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Preface 



We are delighted to present the proceedings of DAGM 2004, and wish to ex- 
press our gratitude to the many people whose efforts made the success of the 
conference possible. We received 146 contributions of which we were able to ac- 
cept 22 as oral presentations and 48 as posters. Each paper received 3 reviews, 
upon which decisions were based. We are grateful for the dedicated work of the 
38 members of the program committee and the numerous referees. The careful 
review process led to the exciting program which we are able to present in this 
volume. 

Among the highlights of the meeting were the talks of our four invited speak- 
ers, renowned experts in areas spanning learning in theory, in vision and in 
robotics: 

— William T. Freeman, Artificial Intelligence Laboratory, MIT: Sharing Fea- 
tures for Multi-class Object Detection 

— Pietro Perona, Caltech: Towards Unsupervised Learning of Object Categories 

— Stefan Schaal, Department of Computer Science, University of Southern Cal- 
ifornia: Real-Time Statistical Learning for Humanoid Robotics 

— Vladimir Vapnik, NEC Research Institute: Empirical Inference 

We are grateful for economic support from Honda Research Institute Europe, 
ABW GmbH, Transtec AG, DaimlerClrrysler, and Stemmer Imaging GmbH, 
which enabled us to finance best paper prizes and a limited number of travel 
grants. Many thanks to our local support Sabrina Nielebock and Dagmar Maier, 
who dealt with the unimaginably diverse range of practical tasks involved in 
planning a DAGM symposium. Thanks to Richard van cle Stadt for providing 
excellent software and support for handling the reviewing process. A special 
thanks goes to Jeremy Hill, who wrote and maintained the conference website. 
Without all of your dedicated contributions, the successful 26tlr DAGM Sympo- 
sium in Tubingen would not have been possible. 
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Abstract. We present an approach to discretizing multivariate contin- 
uous data while learning the structure of a graphical model. We derive a 
joint scoring function from the principle of predictive accuracy, which in- 
herently ensures the optimal trade-off between goodness of fit and model 
complexity including the number of discretization levels. Using the so- 
called finest grid implied by the data, our scoring function depends only 
on the number of data points in the various discretization levels (inde- 
pendent of the metric used in the continuous space). Our experiments 
with artificial data as well as with gene expression data show that dis- 
cretization plays a crucial role regarding the resulting network structure. 

1 Introduction 

Continuous data is often discretized as part of a more advanced approach to data 
analysis such as learning graphical models. Discretization may be carried out 
merely for computational efficiency, or because background knowledge suggests 
that the underlying variables are indeed discrete. While it is computationally 
efficient to discretize the data in a preprocessing step that is independent of the 
subsequent analysis [6,10,7], the impact of the discretization policy on the subse- 
quent analysis is often unclear in this approach. Existing methods that optimize 
the discretization policy jointly with the graph structure [3,9] are computation- 
ally very involved and therefore not directly suitable for large domains. 

We present a novel and more efficient scoring function for joint optimiza- 
tion of the discretization policy and the model structure. The objective relies 
on predictive accuracy, where predictive accuracy is assessed sequentially as in 
prequential validation [2] or stochastic complexity [12]. 

2 Sequential Approach 

Let Y = (Yi, ..., Yfe, ..., Y n ) denote a vector of n continuous variables in the 
domain of interest, and y any specific instantiation of these variables. The dis- 
cretization of Y is determined by a discretization policy A = (A \, ..., A n ): for each 
variable Yfc, let A j. = (A k,i, • ••, ^k,r k -i) be ordered threshold values, and rk be the 

C.E. Rasmussen et al. (Eds.): DAGM 2004, LNCS 3175, pp. 1-8, 2004. 

(c) Springer- Verlag Berlin Heidelberg 2004 
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number of discretization levels. This determines the mapping f y \ : Y h > X, where 
X = (Xl , ...,Xk, ■■■,X n ) is the corresponding discretized vector; for efficiency 
reasons we only consider deterministic discretizations, where each continuous 
value y is mapped to exactly one discretization level, Xk = //i fc (j/fc)- 

We pretend that (continuous) i.i.d. data D arrive in a sequential manner, 
and then assess predictive accuracy regarding the data points along the sequence. 
This is similar to prequential validation or stochastic complexity [2,12]. We recast 
the joint marginal likelihood of the discretization policy A and the structure m 
of a graphical model in a sequential manner, 

N 

p(D\A,m ) =Y[p{y M \D ( ' l ~ 1) ,A,m), 

i= 1 

where = (?/* _1 ), ..., y^) denotes the data points seen prior to 

step i along the sequence. 

For deterministic discretization we can assume that at each step 
i the predicted density regarding data point y^’ factors according to 
p(j/W|D(* _1 \ A, to) = pfy^\x^\ A) p(x^\D < - l ~ 1 \ m, A), where = /u(y^^)- 
It is desirable that the structure to indeed captures all the relevant (conditional) 
dependences among the variables Yi,...,Y n . Assuming that the dependences 
among continuous Y j. are described by the discretized distribution p(X\m, A, D), 
then any two continuous variables Y k and Y k > are independent conditional on X: 

p{y (i) \x {i) , A) = nLiPfolVU*)- 

The computational feasibility of this approach depends crucially on the ef- 
ficiency of the mapping between the discrete and continuous spaces. A simple 
approach may use the same density to account for points y and y' that are 
mapped to the same discretized state x, cf. [9]. Assuming a uniform probabil- 
ity density is overly stringent and degrades the predictive accuracy; moreover, 
this might also give rise to ’’empty states”, cf. [15]. In contrast, we require only 
independence of the variables Yfc. 

3 Finest Grid Implied by the Data 

The finest grid implied by the data is a simple mapping between Y and X that 
retains the desired independence properties with non-uniform densities, and can 
be computed efficiently. 

This grid is obtained by discretizing each variable Y k such that the corre- 
sponding (new) discrete variable Z k has as many states as there are data points, 
and exactly one data point is assigned to each of those states (an extension to 
the case with identical data points is straightforward; also note that this grid 
is not unique, as any threshold value between neighboring data points can be 
chosen). Note that, in our predictive approach, this grid is based on data D < - 1 ~ 1 ' ) 
at each step i. 

Based on this grid, we can now obtain an efficient mapping between Y and 
X as follows: we assume that two points yk and y’ k in the continuous space get 
assigned the same density if they map to the same state of Z k \ and that two 
states Zk and z’ k of Z k get assigned the same probability mass if they map to the 
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same discretization level of Xk (we require that each state of Z k is mapped to 
exactly one discretization level of Xk for computational efficiency). This immedi- 
ately yields p(y^\x^ l \ Ak) = c/N where N denotes the number of data 

x k x k 

points in discretization level x I*) of variable Xk before step i along the sequence 
(N^Z ) 1 ' > > 0). The constant c absorbs the mapping from Z to Y by means of the 

x k . 

finest grid. Using the same grid for two models being compared, we have the 
important property that c is irrelevant for determining the optimal A and m. 
Unfortunately, details have to be skipped here due to lack of space, see also [15]. 

4 Predictive Discretization 

In our sequential approach, the density at data point yW is predicted strictly 
without hindsight at each step i, i.e., only data D b _ D j s use d. For this reason, 
this leads to a fair assessment of predictive accuracy. Since i.i.d. data lack an 
inherent sequential ordering, we may choose a particular ordering of the data 
points. This is similar in spirit to stochastic complexity [12], where also a par- 
ticular sequential ordering is used. The basic idea is to choose an ordering such 
that, for all Xk , we have Nx k ^ > 0 for all i > io, where io is minimal. The initial 
part of this sequence is thus negligible compared to the part where i = io, ..., N 
when the number of data points is considerably larger than the number of dis- 
cretization levels of any single variable, N ;§> max*, | X/j | /\ . Combining the above 
equations, we obtain the following (approximate) predictive scoring function 
C(A, m ): 

log p(D\A, m) « C{A, m) + c' = logp(ZUi|ra) - log G{D, A) + c', (1) 

where the approximation is due to ignoring the short initial part of the se- 
quence; p(Dj\\m) is the marginal likelihood of the graph m in light of the data 
Da discretized according to A. In a Bayesian approach, it can be calculated 
easily for various graphical models, e.g., see [1,8] concerning discrete Bayesian 
networks. The second term in Eq. 1 is given by 

n 

log G(D, A) = EE iogr(N(x k )), 

fc= 1 Xk 

where r denotes the Gamma function, r(N(xk)) = [IV(xfc) — 1]!, and N(xk) 
is the number of data points in discretization level Xk - 1 It is crucial that the con- 
stant d , which collects the constants c from above, is irrelevant for determining 
the optimal A and in. Obeying lack of space, the reader is referred to [15] for 
further details. 

Our scoring function C(A,m) has several interesting properties: First, the 
difference between the two terms in Eq. 1 determines the trade-off dictating 
the optimal number of discretization levels, threshold values and graph struc- 
ture. As both terms increase with a diminishing number of discretization levels, 

Note that N(xk) > 0 is ensured in our approach, i.e., there are no ’’empty states” 

[15]- 
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the second term can be viewed as a penalty for small numbers of discretization 
levels. Second, C(A, to) depends on the number of data points in the different 
discretization levels only. This is a consequence of the finest grid implied by the 
data. It has several interesting implications. First, and most important from 
a practical point of view, it renders efficient evaluation of the scoring function 
possible. Second, and more interesting from a conceptual perspective, C(A , m) is 
independent of the particular choice of the finest grid. Apart from that, C{A, in) 
is independent of the metric in the continuous space, and thus invariant under 
monotonic transformations of the continuous variables. Obviously, this can lead 
to considerable loss of information, particularly when the (Euclidean) distances 
among the various data points in the continuous space govern the discretization 
(cf. left graph in Fig. 1). On the other hand, the results of our scoring function 
are not degraded if the data is given w.r.t. an inappropriate metric. In fact, the 
optimal discretization w.r.t. our scoring function is based on statistical depen- 
dence of the variables, rather than on the metric. This is illustrated in our toy 
experiments with artificial data, cf. Section 5. Apart from that, our approach 
includes as a special case quantile discretization, namely when all the variables 
are independent of each other. 

5 Experiments 

In our first two experiments, we show that our approach discretizes the data 
based on statistical dependence rather than on the metric in the continuous 
space. Consider the left two panels in Fig. 1: when the variables are indepen- 
dent ;, our approach may not find the discretization suggested by the clusters; 
instead, our approach assigns the same number of data points to each discretiza- 
tion level (with one discretization level being optimal). Note that discretization 
of independent variables is, however, quite irrelevant when learning graphical 
models: the optimal discretization of each variable Yfc depends on the variables 
in its Markov blanket, and Yj, is (typically strongly) dependent on those vari- 
ables. When the variables are dependent in Fig. 1, our scoring function favours 
the ’’correct” discretization (solid lines), as this entails best predictive accuracy 
(even when disregarding the metric). However, dependence of the variables it- 
self does not necessarily ensure that our scoring function favours the ’’correct” 
discretization, as illustrated in the right two panels in Fig. 1 (as a constraint, 
we require two discretization levels): given low noise levels, our scoring function 
assigns the same number of data points to each discretization level; however, a 
sufficiently high noise level in the data can actually be beneficial, permitting our 
approach to find the ’’correct” discretization, cf. Fig. 1 (right). 

Our third experiment demonstrates that our scoring function favours less 
complex models (i.e., sparser graphs and fewer discretization levels) when given 
smaller data sets. This is desirable in order to avoid overfitting when learning 
from small samples, leading to optimal predictive accuracy. We considered a pair 
of normally distributed random variables Yq and Y\ with correlation coefficient 
corr(Yo,Yi) = l/\/2. Note that this distribution does not imply a ’natural’ 
number of discretization levels; due to the dependence of Yq and Y) one may 
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Fig. 1. Left two panels: each cluster comprises 100 points sampled from a Gaussian 
distribution; Lo and Vj are independent on the left, and dependent on the right. Right 
two panels: when Yq and Yi are dependent, noise may help in finding the ’correct’ 
discretization. 



hence expect the learned number of discretization levels to rise with growing 
sample size. Indeed, Fig. 2 shows exactly this behavior. Moreover, the learned 
graph structure implies independence of Yq and Y) when given very small samples 
(fewer than 30 data points in our experiment), while Yq and Y\ are found to be 
dependent for all larger sample sizes. 

In our fourth experiment, we were concerned with gene expression data. In 
computational biology, regulatory networks are often modeled by Bayesian net- 
works, and their structures are learned from discretized gene-expression data, 
see, e.g., [6,11,7]. Obviously, one would like to recover the ’’true” network struc- 
ture underlying the continuous data, rather than a degraded network struc- 
ture due to a suboptimal discretization policy. Typically, the expression levels 
have been discretized in a preprocessing step, rather than jointly with the net- 
work structure, [6,11,7]. In our experiment, we employed our predictive scoring 
function (cf. Eq. 1) and re-analyzed the gene expression data concerning the 
pheromone response pathway in yeast [7], comprising 320 measurements con- 
cerning 32 continuous variables (genes) as well as the mating type (binary vari- 
able). Based on an error model concerning the micro-array measurements, a 
continuously differentiable, monotonic transformation is typically applied to the 
raw gene expression data in a preprocessing step. Since our predictive scoring 
function is invariant under this kind of transformation, this has no impact on 
our analysis, so that we are able to work directly with the raw data. 

Instead of using a search strategy in the joint space of graphs and discretiza- 
tion policies — the theoretically best, but computationally most involved ap- 
proach — we optimize the graph m and the discretization policy A alternately 
in a greedy way for simplicity: given the discretized data D^, we use local search 
to optimize the graph to, like in [8]; given to, we optimize A iteratively by im- 
proving the discretization policy regarding a single variable given its Markov 
blanket at a time. The latter optimization is carried out in a hierarchical way 
over the number of discretization levels and over the threshold values of each 
variable. Local maxima are a major issue when optimizing the predictive scoring 
function due to the (strong) interdependence between to and A. As a simple 
heuristic, we alternately optimize A and to only slightly at each step. 

The marginal likelihood p(Dy{\m), which is part of our scoring function, con- 
tains a free parameter, namely the so-called scale-parameter a regarding the 
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sample size 



Fig. 2. The number of discretization levels (mean and standard deviation, averaged 
over 10 samples of each size) depends on the sample size (cf. text for details). 



Dirichlet prior over the model parameters, e.g., cf. [8]. As outlined in [13], its 
value has a decisive impact on the resulting number of edges in the network, and 
must hence be chosen with great care. Assessing predictive accuracy by means 
of 5-fold cross validation, we determined a ss 25. 

Fig. 3 shows the composite graph we learned from the used gene expres- 
sion data, employing our predictive scoring function, cf. Eq. 1. This graph is 
compiled by averaging over several Bayesian network structures in order to ac- 
count for model uncertainty prevailing in the small data set. Instead of exploring 
model uncertainty by means of Markov Chain Monte Carlo in the model space, 
we used a non-parametric re-sampling method, as the latter is independent of 
any model assumptions. While the bootstrap has been used in [5,4,6,11], we 
prefer the jackknife when learning the graph structure, i.e., conditional indepen- 
dences. The reason is that the bootstrap procedure can easily induce spurious 
dependencies when given a small data set D ; as a consequence, the resulting 
network structure can be considerably biased towards denser graphs [14]. The 
jackknife avoids this problem. We obtained very similar results using three differ- 
ent variants of the jackknife: delete-1, clelete-30, and delete-64. Averaging over 
320 delete-30 jackknife sub-samples, we found 65.7 ± 8 edges. Fig. 3 displays 
65 edges: the solid ones are present with probability > 50%, and the dashed 
ones with probability > 34%. The orientation of an edge is indicated only if 
one direction is at least twice as likely as the contrary one. Apart from that, 
our predictive scoring function yielded that most of the variables have about 4 
discretization levels (on average over the 320 jackknife samples), except for the 
genes MCM1, MFALPHA1, KSS1, STE5, STE11, STE20, STE50, SWI1, TUP1 
with about 3 states, and the genes BARI, MFA1, MFA2, STE2, STE6 with ca. 
5 states. 

In Fig. 3, it is apparent that the genes AGA2, BARI, MFA1, MFA2, STE2, 
and STE6 (magenta) are densely interconnected, and so is the group of genes 
MFALPHA1, MFALPHA2, SAG1 and STE3 (red). Moreover, both of those 
groups are directly connected to the mating type, while the other genes in the 
network are (marginally) independent of the mating type. This makes sense 

2 We imposed no constraints on the network structure in Fig. 3. Unfortunately, the 
results we obtained when imposing constraints derived from location data have to 
be skipped due to lack of space. 
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Fig. 3. This graph is compiled from 320 delete-30 jackknife samples (cf. [7] for the 
color-coding) . 



from a biological perspective, as the former genes (magenta) are only expressed 
in yeast cells of mating type A, while the latter ones (red) are only expressed in 
mating type ALPHA; the expression level of the other genes is rather unaffected 
by the mating type. Due to lack of space, a more detailed (biological) discussion 
has to be omitted here. 

Indeed, this grouping of the genes is supported also when considering corre- 
lations as a measure of statistical dependence: 3 we find that the absolute value 
of the correlations between the mating type and each gene in either group from 
above is larger than 0.38, while any other gene is only weakly correlated with 
the mating type, namely less than 0.18 in absolute value. 

The crucial impact of the used discretization policy A and scale-parameter 
a on the resulting network structure becomes apparent when our results are 
compared to the ones reported in [7]: their network structure resembles a naive 
Bayesian network, where the mating type is the root variable. Obviously, their 
network structure is notably different from ours in Fig. 3, and hence has very 
different (biological) implications. Unlike in [7], we have optimized the discretiza- 
tion policy A and the network structure m jointly, as well as the scale-parameter 
a. As the value of the scale-parameter a mainly affects the number of edges 
present in the learned graph [13], this suggests that the major differences in the 
obtained network structures are actually due to the discretization policy. 



6 Conclusions 

We have derived a principled yet efficient method for determining the resolution 
at which to represent continuous observations. Our discretization approach relies 
on predictive accuracy in the prequential sense and employs the so-called finest 

3 Note that correlations are applicable here, even though they measure only linear 
effects. This is because the mating type is a binary variable. 
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grid implied by the data as the basis for finding the appropriate levels. Our 
experiments show that a suboptimal discretization method can easily degrade 
the obtained results, which highlights the importance of the principled approach 
we have proposed. 
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Abstract. Most image segmentation algorithms optimize some mathematical 
similarity criterion derived from several low-level image features. One possible 
way of combining different types of features, e.g. color- and texture features on 
different scales and/or different orientations, is to simply stack all the individ- 
ual measurements into one high-dimensional feature vector. Due to the nature of 
such stacked vectors, however, only very few components (e.g. those which are 
defined on a suitable scale) will carry information that is relevant for the actual 
segmentation task. We present an approach to combining segmentation and adap- 
tive feature selection that overcomes this relevance determination problem. All 
free model parameters of this method are selected by a resampling-based stability 
analysis. Experiments demonstrate that the built-in feature selection mechanism 
leads to stable and meaningful partitions of the images. 



1 Introduction 

The goal of image segmentation is to divide an image into connected regions that are 
meant to be semantic equivalence classes. In most practical approaches, however, the 
semantic interpretation of segments is not modeled explicitly. It is, rather, modeled 
indirectly by assuming that semantic similarity corresponds with some mathematical 
similarity criterion derived from several low-level image features. Following this line 
of building segmentation algorithms, the question of how to combine different types of 
features naturally arises. One popular solution is to simply stack all different features 
into a high-dimensional vector, see e.g [1], The individual components of such a fea- 
ture vector may e.g. consist of color frequencies on different scales and also on texture 
features both on different scales and different orientations. The task of grouping such 
high-dimensional vectors, however, typically poses two different types of problems: on 
the technical side, most grouping algorithms become increasingly instable with growing 
input space dimension. Since for most relevant grouping criteria no efficient globally 
optimal optimization algorithms are known, this “curse of dimensionality” problem is 
usually related to the steep increase of local minima of the objective functions. Apart 
from this technical viewpoint, the special structure of feature vectors that arise from 
stacking several types of features poses another problem which is related to the rele- 
vance of features for solving the actual segmentation task. For instance, texture features 
on one particular scale and orientation might be highly relevant for segmenting a textile 
pattern from an unstructured background, while most other feature dimensions will ba- 
sically contain useless “noise” with respect to this particular task. Treating all features 
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equally, we cannot expect to find a reliable decomposition of the image into mean- 
ingful classes. Whereas the ‘“curse of dimensionality’’-problem might be overcome by 
using a general regularization procedure which restricts the intrinsic complexity of the 
learning algorithm used for partitioning the image, the special nature of stacked feature 
vectors particularly emphasizes the need for an adaptive feature selection or relevance 
determination mechanism. 

In supervised learning scenarios, feature selection has been studied widely in the 
literature. Selecting features in unsupervised partitioning scenarios, however, is a much 
harder problem, due to the absence of class labels that would guide the search for rel- 
evant information. Problems of this kind have been rarely studied in the literature, for 
exceptions see e.g. [2,9,15]. The common strategy of most approaches is the use of an 
iterated stepwise procedure: in the first step a set of hypothetical partitions is extracted 
(the clustering step), and in the second step features are scored for relevance (the rele- 
vance determination step). A possible shortcoming is the way of combining these two 
steps in an “ad hoc’’ manner: firstly, standard relevance determination mechanism do not 
take into account the properties of the clustering method used. Secondly, most scoring 
methods make an implicit independence assumption, ignoring feature correlations. It is 
thus of particular interest to combine feature selection and partitioning in a more prin- 
cipled way. We propose to achieve this goal by combining a Gaussian mixture model 
with a Bayesian relevance determination principle. Concerning computational problems 
involved with selecting “relevant” features, a Bayesian inference mechanism makes it 
possible to overcome the combinatorial explosion of the search space which consists of 
all subsets of features. As a consequence, we are able to derive an efficient optimization 
algorithm. The method presented here extends our previous work on combining cluster- 
ing and feature selection by making it applicable to multi-segment problems, whereas 
the algorithms described in [ 13,12] were limited to the two-segment case. 

Our segmentation approach involves two free parameters: the number of mixture 
components and a certain constraint value which determines the average number of 
selected features. In order to find reasonable settings for both parameters, we devise 
a resampling-based stability model selection strategy. Our method follows largely the 
ideas proposed in [8] where a general framework for estimating the number of clusters 
in unsupervised grouping scenarios is described. It extends this concept, however, in one 
important aspects: not only the model order (i.e. the number of segments) but also the 
model complexity for a fixed model order (measured in terms of the number of selected 
features) is selected by observing the stability of segmentations under resampling. 



2 Image Segmentation by Mixture Models 



As depicted in figure 1 we start with extracting a set of N image-sites, each of which is 
described by a stacked feature vector x t € W l with d components. The stacked vector 
usually contains features from different cues, like color histograms and texture responses 
from Gabor filters, [10]. For assigning the sites to classes, we use a Gaussian mixture 
model with K mixture components sharing an identical covariance matrix E. Under this 
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model, the data log-likelihood reads 

lm ix = log (Y,ILi Kv<f>{Xi] £)) , (1) 

where the mixing proportions 7r v sum to one, and <f> denotes a Gaussian density. It is 
well-known that the classical expectation-maximization (EM) algorithm, [3], provides 
a convenient method for finding both the component-membership probabilities and the 
model parameters (i.e. means and covariance) which maximize l mlx . Once we have 
trained the mixture model (which represents a parametric density on K/ i ) we can easily 
predict the component-membership probabilities of sites different from those contained 
in the training set by computing Mahalonobis distances to the mean vectors. 




histogram features +avg. gabor coeff. 
= stacked feature vector 



Fig. 1. Image-sites and stacked feature vectors (schematically). 



2.1 Gaussian Mixtures and Bayesian Relevance Determination 

In order to incorporate the feature selection mechanism into the Gaussian mixture model, 
the M-step of the EM-algorithm undergoes several reformulations. Following [5], the 
M-step can be carried out by linear discriminant analysis (LDA) which uses the “fuzzy 
labels” estimated in the preceding E-step. LDA is equivalent to an optimal scoring 
problem (cf. [6]), the basic ingredient of which is a linear regression procedure against 
the class-indicator variables. Since space here precludes a more detailed discussion of 
the equivalence of the classical M-step and indicator regression, we refer the interested 
reader to the above references and we will concentrate in the following on the aspect of 
incorporating the feature selection method into the regression formalism. 

A central ingredient of optimal scoring is the “blurred” response matrix Z, whose 
rows consist of the current membership probabilities. Given an initial ( K x K — 1) 
scoring matrix 0, a sequence of K — 1 linear regression problems of the form 

find Gj , /3j which minimize \\Z0j — X/3j\\2 , j = l,...,K—l (2) 

is solved. X is the data matrix which contains the stacked feature vectors as rows. We 
incorporate the feature selection mechanism into the regression problems by specifying 
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a prior distribution over the regression coefficients /3. This distribution has the form of 
an Automatic Relevance Determination (ARD) prior: p(/3\ fl) oc exp[— y/f i 
For each regression coefficient, the ARD prior contains a free hyperparameter i) t , which 
encodes the “relevance” of the i-th variable in the linear regression. Instead of explicitly 
selecting these relevance parameters, which would necessarily involve a search over of 
all possible subsets of features, we follow the Bayesian view of [4] which consists of 
“averaging” over all possible parameter settings: given exponential hyperpriors, = 
^ exp{— one can analytically integrate out the relevance-parameters from the prior 

distribution over the coefficients. Switching to the maximum a posteriori (MAP) solution 
in log-space, this Bayesian marginalization directly leads to the following /) -constrained 
regression problems: 

minimize \\Z9j — Xf3j\\l subject to || ( 3 j || 1 <k, j = 1, . . . , K — 1, (3) 

where ||/3 -||i denotes the iy norm of the vector of regression coefficients in the j-th 
regression problem. This model is known as the LASSO, see [14]. A highly efficient 
algorithm for optimizing the LASSO model can be found in [11]. 

According to [5], in the optimal scoring problem the regression fits are followed by 
finding a sequence of optimal orthogonal scores <9 which maximize trace{0 T Z T X B}, 
where the matrix B contains the optimal vectors /T, , . . . ,(3 K _ 1 as columns. In the 
unconstrained case described in [5], this maximization amounts to finding the K — 1 
largest eigenvectors tq of the symmetric matrix M = 0 T Z T XB. The matrix B is 
then updated as B B V. In our case with active () constraint, the matrix M is no 
longer guaranteed to be symmetric. Maximization of the symmetrized problem M sym = 
1/2 • M'M, however, may be viewed as a natural generalization. We thus propose to 
find the optimal scores by an eigen-decomposition of M sym . 

Summing up. For feature selection, we ideally would like to estimate the value of 
a binary selection variable: Si equals one, if the i-th feature is considered relevant for 
the given task, and zero otherwise. Taking into account feature correlations, however, 
estimation of S involves searching the space of all possible subsets of features. In the 
Bayesian ARD formalism, this combinatorial explosion of the search space is overcome 
by relaxing the binary selection variable to a real-valued relevance parameter. Following a 
Bayesian inference principle, we introduce hyper-priors and integrate out these relevance 
parameters, and we finally arrive at a sequence of -constrained LASSO problems, 
followed by an eigen-decomposition to find the optimal scoring vectors. It is of particular 
importance that this method combines the issues of grouping and feature selection in a 
principled way: both goals are achieved simultaneously by optimizing the same objective 
function, which is simply the constrained data log-likelihood. 

3 Model Selection and Experimental Evaluation 

Our model has two free parameters, namely the number of mixture components and the 
value of the ^-constraint n. Selecting the number of mixture components is referred 
to as the model order selection problem, whereas selecting the number of features can 
be viewed as the problem of choosing the complexity of the model. We now describe a 
method for selecting both parameters by observing the stability of segmentations. 




Adaptive Feature Selection in Image Segmentation 



13 



Selecting the model complexity. We will usually find many potential splits of the data 
into clusters, depending on how many features are selected: if we select only one feature, 
it is likely to find many competing hypotheses for splits, since most of the feature vectors 
vote for a different partition. Taking into account the problem of noisy measurements, the 
finally chosen partition will probably tell us more about the exact noise realization than 
about meaningful splits. If, on the other hand, we select too many features, many of them 
will be irrelevant for the actual task, and with high probability, the EM-algorithm will 
find suboptimal solutions. Between these two extremes, we can hope to find relatively 
stable splits, which are robust against noise and also against inherent instabilities of 
the optimization method. For a fixed model order, we use the following algorithm for 
assessing the value of k: 

1. Sampling: draw randomly 100 datasets (i.e. sets of sites), each of which contains 
N sites. For each site extract the stacked feature vector. 

2. Stability analysis: for different constraint values k repeat: 

a) Clustering: For each set of sites, train a mixture model with K modes. Assign 
each of the the sites in the i-th set to one of K groups, based on the estimated 
membership probabilities. Store the labels li and the model parameters p i . 

b) For each pair (i, j), j of site sets do 

Prediction: use the ?'-th mixture model (we have stored all parameters in pj to 
predict the labels of the j-th sample. Denote these labels by /;■ ; 

Distance calculation: calculate the permutation-corrected Hamming distance 
between original and predicted labels by minimizing over all permutations 7 r: 

^Hamming = 1 - 5{lj (k) , Tr(lj (fc))}, (4) 

(5 denotes the Kronecker symbol), and store it in the (100 x 100) matrix D. 
The minimization over all permutations can be done efficiently by using the 
Hungarian method for bipartite matching with time complexity 0(K 3 ), [7]. 

c) Partition clustering & prototype extraction: use Wards agglomerative 
method to cluster the matrix D. Stop merging partition-clusters if the aver- 
age within-cluster Hamming distance exceeds a threshold e = 7 • (1 - 1 /K) 
proportional to the expected distance in a random setting (for random labellings 
we expect an average distance of (1 — 1/ K)). In the experiments we have chosen 
7 = 0.05 = 5%. In each partition-cluster, select the partition which is nearest 
to the cluster centroid as the prototypical partition. 

Selecting the model order. In order to select a suitable number K of mixture com- 
ponents, we repeat the whole complexity selection process for different values of K. 
We consider that A -value as the most plausible one, for which the percentage of parti- 
tions in the individual partition clusters attains a maximum. Since in most unsupervised 
grouping problems there is more than one “interesting” interpretation of the data, we 
might, however, gain further insights by also studying other A -values with high but not 
maximal stability, see figure 4 for an example. 

Figures 2 and 3 show the results of the model selection process for an artificial image 
with five segments. Two of the segments are solely defined in terms of different grey 
value distributions without any texture information. Two other segments, on the other 
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hand, contain the same texture pattern in different orientations which makes them indis- 
tinguishable in the terms of grey values. In order to capture both types of information, at 
each site we stacked 12 grey value histogram bins and 16 Gabor coefficients on different 
scales and orientations into a 28-dimensional feature vector. The features are normalized 
to zero mean and unit variance across the randomly chosen set of image-sites. The right 
panel of figure 2 depicts the outcome of the model-order selection process. The stability 
curve shows a distinct maximum for 5 mixture components. 83% of all partitions found 
in 100 resampling experiments are extremely similar: their average divergence is less 
than 5% of the expected divergence in a random setting. 

Figure 3 gives more insight into the model-complexity selection process for this most 
stable number of mixture components. For small values of the l\ constraint k only very 
few features are selected which leads to highly fluctuating segmentations. This observa- 
tion is in accordance with our expectation that the selection of only a few single features 
would be highly sensitive to the sampling noise. The full model containing all features 
also turns out to be rather instable, probably due to the irrelevance of most feature dimen- 
sions. For the task of separating e.g. the two segments which contain the same texture in 
different orientations, all color features are basically uninformative noise dimensions. 
Between these two extremes, however, we find a highly stable segmentation result. On 
average, 13 features are automatically selected. More important than this average num- 
ber, however, is the fact that in each of the 4 regression fits (we have I\ = 5 mixture 
components and thus K — 1 = 4 fits) the features are selected in an adaptive fashion: 
in one of the regression problems almost exclusively grey-value features are selected, 
whereas two other regression fits mainly extract texture information. By combining the 
4 regression fits the model is able to extract both types of information while successfully 
suppressing the irrelevant noise content. 




AJb H T d 



Number of mixture components 
~ 3 4 5 6 



Fig. 2. Model-order selection by resampling: stability of segmentations (measured in terms of 
percentage of highly similar partitions) vs. number of mixture components. Right: input image. 



Real word examples. We applied our method to several images from the Corel 
database. Figure 4 shows the outcome of the whole model selection process for an 
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Fig. 3. Selecting the model-complexity 
for fixed number of mixture components 
K = 5. Solid curve: stability vs. i\ con- 
straint k. Dashed curve: number of se- 
lected features 



image taken from the Corel “shell-textures” category, see figure 5. The stability curve for 
assessing the correct model order favors the use of two mixture components. In this case, 
the most stable partitions are obtained for a highly constrained model which employs on 
average only 2 features (left panel). A closer look on the partition clusters show that there 
is a bimodal distribution of cluster populations: 44 partitions found in 100 resampling 
experiments form a cluster that segments out the textured shell from the unstructured 
environment (only texture features are selected in this case), whereas in 37 partitions 
only color features are extracted, leading to a bipartition of the image into shadow and 
foreground. 




Fig. 4. A shell image from the corel database: model selection by resampling. 



Both possible interpretations of the image are combined in the three-component 
model depicted in the right panel. The image is segmented into three classes that corre- 
spond to “shell", “coral” and “shadow”. The most stable three-component model uses 
a combination of five texture and three color features. This example demonstrates that 
due to the unsupervised nature of the segmentation problem, sometimes there are more 
than one “plausible” solutions. Our feature selection process is capable of exploring 
such ambiguities, since it provides the user not only with a single optimal model but 
with a ranked list of possible segmentations. The reader should notice that also in this 
example the restriction of the model complexity enforced by the t\ constraint is crucial 
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Fig. 5. The shell image and the three - 
component segmentation solution 



for obtaining stable segmentations. We applied our method to several other images from 
the Corel database, but due to space limitations we refer the interested reader to our 
web-page www.inf.ethz.ch/~vroth/segments_dagm.htrnl. 

4 Discussion 

In image segmentation, one often faces the problem that relevant information is spread 
over different cues like color and texture. And even within one cue, different scales 
might be relevant for segmenting out certain segments. The question of how to combine 
such different types of features in an optimal fashion is still an open problem. We 
present a method which overcomes many shortcomings of “naively” stacking all features 
into a combined high-dimensional vector which then enters a clustering procedure. 
The main ingredient of the approach is an automatic feature selection mechanism for 
distinguishing between “relevant” and “irrelevant” features. Both the process of grouping 
sites to segments and the process of selecting relevant information are subsumed under 
a common likelihood framework which allows the algorithm to select features in an 
adaptive task-specific way. This adaptiveness property makes it possible to combine the 
relevant information from different cues while successfully suppressing the irrelevant 
noise content. Examples for both synthetic and natural images effectively demonstrate 
the strength of this approach. 



Acknowledgments. The authors would like to thank Joachim M. Buhmann for helpful 
discussions and suggestions. 
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Abstract. The use of non-orthonormal basis functions in ridge regres- 
sion leads to an often undesired non-isotropic prior in function space. 
In this study, we investigate an alternative regularization technique that 
results in an implicit whitening of the basis functions by penalizing di- 
rections in function space with a large prior variance. The regularization 
term is computed from unlabelled input data that characterizes the in- 
put distribution. Tests on two datasets using polynomial basis functions 
showed an improved average performance compared to standard ridge 
regression. 



1 Introduction 

Consider the following situation: We are given a set of N input values x, g R m 
and the corresponding N measurements of the scalar output values tj . Our task 
is to model the output by linear combinations from a dictionary of fixed functions 
(fii of the input x, i.e. , 

M 

Vi = y^7 jipj{xj), or more conveniently, y t = 7 T 0(xj), (1) 

i=i 

using 0(x,) = (y>i(xj), (p 2 (x.i), . . . ) T . The number of functions M in the dictio- 
nary can be possibly infinite as for instance in a Fourier or wavelet expansion. 
Often, the functions contained in the dictionary are neither normalized nor or- 
thogonal with respect to the input. This situation is common in kernel ridge 
regression with polynomial kernels. Unfortunately, the use of a non-orthonormal 
dictionary in conjunction with the ridge regularizer || 7 || 2 often leads to an un- 
desired behaviour of the regression solutions since the constraints imposed by 
this choice rarely happen to reflect the - usually unknown - prior probabilities 
of the regression problem at hand. This can result in a reduced generalization 
performance of the solutions found. 

In this study, we propose an approach that can alleviate this problem either 
when unlabelled input data is available, or when reasonable assumptions can be 
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made about the input distribution. From this information, we compute a regular- 
ized solution of the regression problem that leads to an implicit whitening of the 
function dictionary. Using examples from polynomial regression, we investigate 
whether whitened regression results in an improved generalisation performance. 

2 Non-orthonormal Functions and Priors in Function 
Space 

The use of a non-orthonormal function dictionary in ridge regression leads to 
a non-isotropic prior in function space. This can be seen in a simple toy exam- 
ple where the function to be regressed is of the form U = s\n{axi) / {aXi) + n, 
with the input 27 uniformly distributed in [— 1, 1] and an additive Gaussian noise 
signal Hi ~ N(0,a 2 ). Our function dictionary consists of the first six canonical 
polynomials <j> \{x) = 1 , <fi 2 ( 2 ;) = x, 4>s{ x ) = x 2 , . . . , <j>^(x) = x 5 which are neither 
orthogonal nor normalized with respect to the uniform input. The effects on the 
type of functions that can be generated by this choice of dictionary can be seen 
in a simple experiment: we assume that the weights in Eq. 1 are distributed 
according to an isotropic Gaussian, i.e., 7 « _/V(0,cr 2 / 6 ) such that no function 
in the dictionary receives a higher a priori weight. Together with Eq. 1, these 
assumptions define a prior distribution over the functions y{x) that can be gen- 
erated by our dictionary. In our first experiment, we draw samples from this 
distribution (Fig. la) and compute the mean square of y{x) at all x € [—1, 1] for 
1000 functions generated by the dictionary (Fig. 16). It is immediately evident 
that, given a uniform input, our prior narrowly constrains the possible solutions 
around the origin while admitting a broad variety near the interval boundaries. 
If we do ridge regression with this dictionary (here we used a Gaussian Process 
regression scheme, for details see [5]), the solutions tend to have a similar be- 
haviour as long as they are not enough constrained by the data points (see the 
diverging solution at the left interval boundary in Fig. lc). This can lead to bad 
predictions in sparsely populated areas. 

If we choose a dictionary of orthonormal polynomials instead (in our ex- 
ample these are the first six Legendre polynomials), we observe a different be- 
haviour: the functions sampled from the prior show a richer structure (Fig. Id) 
with a relatively flat mean square value over the interval [—1,1] (Fig. le). As a 
consequence, the ridge regression solution usually does not diverge in sparsely 
populated regions near the interval boundaries (Fig. 1/). 

The reason for this behaviour can be seen if one thinks of the functions 
as points in a function space. The dictionary defines a basis in a subspace such 
that all possible solutions of the form Eq. 1 are linear combinations of these basis 
vectors. Assuming an isotropic distribution of the weights, a non-orthogonal basis 
results in a non-isotropic distribution of points in function space. As a result, 
any new function to be expressed (or regressed) in this basis will have a larger 
probability if its projection onto the basis happens to be along a larger principal 
component, i.e., we have a non-isotropic prior in function space. Conversely, an 
orthonormal basis in conjunction with an isotropic weight distribution results in 
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e. /. 





x 
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Fig. 1. Toy experiment using polynomial bases: a. 10 examples drawn from the prior 
in function space generated by the first 6 canonical polynomials and b. generated by 
the first 6 Legendre polynomials; c. Mean squared value in the interval [—1,1] of 1000 
random linear combinations of the first 6 canonical polynomials and d., of the first 6 
Legendre polynomials; e. Regression on 10 training samples (stars) using the canonical 
polynomial basis and /., the Legendre basis. The dashed line denotes the true function, 
the solid line the prediction from regression. The shaded areas show the 95%-confidence 
intervals. 
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an isotropic prior in function space such that no specific function is preferred over 
another. This situation may often be preferable if nothing is known in advance 
about the function to be regressed. 

3 Whitened Regression 

The standard solution to regression is to find the weight vector 7 in Eq. 1 that 
minimizes the sum of the squared errors. If we put all </)(x,;) into an N x M 
design matrix $ with $ = (^(x!) T , ^(x 2 ) T , . . . <^(xjv) t ) t , the model (1) can be 
written as y = ^7 such that the regression problem can be formulated as 

argmin(t — ^7) 2 . (2) 

7 

The problem with this approach is that if the noises rq are large, then forcing 
y to fit as closely as possible to the data results in an estimate that models the 
noise as well as the function to be regressed. A standard approach to remedy 
such problems is known as the method of regularization in which the square 
error criterion is augmented with a penalizing functional 

(t-0 7 ) 2 + AJ(7), A > 0. (3) 

The penalizing functional J is chosen to reflect prior information that may be 
available regarding 7, A controls the tradeoff between fidelity to the data and the 
penalty J(y). In many applications, the penalizing functional can be expressed 
as a quadratic form 

J(7) = 7 T A;- 1 7 (4) 

with a positive definite matrix The solution of the regression problem can 
be found analytically by setting the derivative of expression (3) with respect to 
7 to zero and solving for 7: 

lopt = (\E- 1 +<P T $)- 1 $ T t. (5) 

Based on 7 optl we can predict the output for the new input x* using 

V* = 7 = t T <?(AA'- 1 + <P T $)- V(x*) (6) 

Note that the solution depends only on products between basis functions eval- 
uated at the training and test points. For certain function classes, these can 
be efficiently computed using kernels (see next section). In ridge regression, an 
isotropic penalty term on 7 corresponding to E 1 = ct 2 Im is chosen. This can 
lead to a non-isotropic prior in function space as we have seen in the last section 
for non-ortlronormal function dictionaries. 

What happens if we transform our basis such that it becomes orthonormal? 
The proper transformation can be found if we know the covariance matrix C $ 
of our basis with respect to the distribution of x 

CV = £ x [((>(x)</>(x) t ] 



(7) 
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where E x denotes the expectation with respect to x. The whitening transform 
is defined as 

D = D T = C~K (8) 

The transformed basis p = Dip has an isotropic covariance as desired: 

C $ = £ x [<Kx)<Kx) t ] = E x [D</>(x)<f>(x) r D r ] = DE x [P(x)P(x) t ]D t = I M . (9) 

Often, however, the matrix C<f, will not have full rank such that a true whitening 
transform cannot be found. In these cases, we propose to use a transform of the 
form 

D = (Ctj) + 7m) 5 - (10) 

This choice prevents the amplification of possibly noise-contaminated eigenvec- 
tors of Ctf, with small eigenvalues (since the minimal eigenvalue of (C^ + Im ) 
is 1) while still leading to a whitening effect for eigenvectors with large enough 
eigenvalues. 

When we substitute the transformed basis p = Dp into Eq. (5) using I7 7 = 
Im i we obtain 

'Yo P t = D- 1 (\(D- 1 ) 2 +$ T $)^$ T t. ( 11 ) 

The prediction equation (6) is accordingly 

y = t T ^(A(D- 1 ) 2 + <? T <?)-V(x*) (12) 

It follows that doing standard ridge regression with a witlrened, orthonormal ba- 
sis is equivalent to doing regression in the original, non-ortlronormal basis using 
the regularizer J(y) = ^ D~ 2 "f. This allows us to use an implicitely whitened 
basis without the need to change the basis functions themselves. This is particu- 
larly useful when we do not have the freedom to choose our basis as, for instance, 
in kernel-based methods where the basis functions are determined by the kernel 
(see next section). 

The proposed approach, however, suffers from a certain drawback because 
we need to know C#. In certain cases, the input distribution is known or can 
be approximated by reasonable assumptions such that can be computed 
beforehand. In other cases, unlabelled data is available which can be used to 
estimate C ,/,■ The training data itself, however, cannot be used to estimate C $ 
since the estimate is proportional to When substituted into Eq. (12) this 
amounts to no regularization at all. As a consequence, for the proposed approach 
to work it is absolutely necessary to obtain C $ from data independent of the 
training data. 

4 Whitened Kernel Regression 

When the number of basis functions is large, a direct solution to the regression 
problem as described in the previous section becomes infeasible. Fortunately, 
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there is a work-around to this problem for many important function classes: 
We noted in the previous section that the regression solution depends only on 
products between basis functions evaluated at the training and test points. For 
certain function dictionaries, the product between the functions evaluated at two 
input values xi and X 2 can be expressed as 

</>(xi ) T <Mx 2 ) = fc(xi,x 2 ). (13) 

The function fc(xi,x 2 ) on ]R m x R m is a positive definite kernel (for a definition 
see [3]). As a consequence, the evaluation of a possibly infinite number of terms 
in ()>(xi) T ((>(x 2 ) reduces to the computation of the kernel k directly on the input. 
Equation (13) is only valid for positive definite kernels, i.e., functions k with 
the property that the Gram matrix Kjj = k(x{,Xj) is positive definite for all 
choices of the x 1; . . . , Xjy. It can be shown that a number of kernels satisfies this 
condition including polynomial and Gaussian kernels [3] . 

A kernelized version of whitened regression is obtained by considering the 
set of n basis functions which is formed by the Kernel PCA Map [3] 

<£(x) = A' _ 5(fc( Xl ,x),fc(x 2 ,x),...fc(xjv,x)) T . (14) 

The subspace spanned by the </>(x.j) has the structure of a reproducing kernel 
Hilbert space (RKHS). By carrying out linear methods in a RKHS, one can 
obtain elegant solutions for various nonlinear estimation problems [3], examples 
being Support Vector Machines. When we substitute this basis in Eq. (5), we 
obtain 



7opt = (A S^+K^Kh (15) 

using the fact that <k> = K~^K = AT 3 = <P T . By setting k(x) = (fc(xi,x), 
fc(x 2 , x), . . . k(x n , x)) T , the prediction (6) becomes 

y* = t T (AA'!V- 1 A'-3 + A') -1 k(x*). (16) 

It can be easily shown that this solution is exactly equivalent to Eq. 6 if Eq. 13 
holds. When choosing = /jy, one obtaines the solution of standard kernel 
ridge regression [1], Application of the whitening prior leads to 

y* = t T (AA + A') -1 k(x») (17) 

Here, C </, = K~iCkK~i and Ck = A x [k(x)k(x) T ]. This results in R = Cfi 
or R = AT _ 2Ck + In, depending on the choice of D. 



5 Experiments 

We compare whitened regression to ridge regression [1] using the kernelized form 
of Eq. 17 with R = A'“5Ck + In and Eq. 16 with U-y — Ini respectively. We 
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Table 1. Average squared error for whitened and ridge regression. Significant p- values 
<0.1 are marked by a star. 



Kernel 


Summed 


Adaptive 


Inhomogeneous 




Polynomial 


Polynomial 


Polynomial 


Sine dataset 








Ridge regression 


1.126 


1.578 


0.863 


Whitened regression 


0.886 


0.592 


0.787 


p-value (t-test) 


0.471 


0.202 


0.064* 


Boston house-price 








Ridge Regression 


18.99 


16.37 


18.74 


Whitened Regression 


12.83 


15.78 


13.08 


p-value (t-test) 


0.022* 


0.817 


0.053* 



consider three types of polynomial kernels that differ in the weights assigned to 
the different polynomial orders: the summed polynomial kernel 

M x i,x 2 ) = ( 18 ) 

the adaptive polynomial kernel 

E d -p 

a i (x 1 x 2 ) 1 ; (19) 

where the weights a* are hyperparameters adapted during the learning process, 
and the inhomogeneous polynomial kernel 

k ihp (x 1 ,x 2 ) = (1 + x ] r x 2 ) <i = J2 i=0 i *) ( x i Fx2 )*- ( 20 ) 

In both experiments, we used a 10 fold cross-validation setup with disjoint test 
sets. For each of the 10 partitions and the different kernels, we computed the 
squared error loss. In addition to the average squared loss, we tested the signif- 
icance of the performance difference between whitened and standard regression 
using a t-test on the squared loss values. 

1. Sine dataset. Our first experiment is the sin(aaj)/(aa:) toy example (a = 8, 
noise variance cr/ = 0.05) of Sec. 2 with disjoint training sets of 10 examples and 
disjoint test sets of 80 examples. We estimated the covariance Ck for Eq. 17 from 
4000 additional unlabelled cases. The hyperparameters A, and a* were estimated 
by conjugate gradient descent on the analytically computed leave-one-out error 
[4], the best degree d was also chosen according to the smallest leave-one-out 
error for all orders up to 10. 

2. Boston Housing. For testing whitened regression on real data, we took dis- 
joint test sets of 50/51 examples and training sets of 455/456 examples from 
the Boston house-price dataset [2]. Note that due to dependencies in the train- 
ing sets, independence assumptions needed for the t-test could be compromised. 
Since the Boston house-price dataset does not provide additional unlabelled 
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data, we had to generate 2000 artificial unlabelled datapoints for each of the 10 
trials based on the assumption that the input is uniformly distributed between 
the minima and maxima of the respective training set. The artificial datapoints 
were used to estimate C'k- Instead of the leave-one-out error, we used conjugate 
gradient descent on a Bayesian criterion for selecting the hyperparameters, usu- 
ally referred to as negative log evidence [5]. The maximal degree d tested was 
5. ' 

The results in Table 1 show that whitened regression performes on the average 
better than standard ridge regression. However the improvement appears to 
be relatively small in many cases such that we get a significant result with 
p < 0.1 only for the inhomogeneous polynomial kernel on both datasets and 
for the summed polynomial kernel on the Boston house-price set. The weaker 
significance of the results on the Sine dataset can be attributed to the very small 
number of training samples which leads to a large variance in the results. 

The assumption of a uniformly distributed input in the Boston housing data 
seems to be useful as it leads to a general improvement of the results. The signif- 
icantly better performance for the summed and the inhomogeneous polynomial 
kernel is mainly caused by the fact that often the standard ridge regression 
found only the linear solution with a typical squared error around 25, whereas 
whitened regression always extracted additional structure from the data with 
squared errors between 10 and 16. 



6 Conclusion 

Using a non-ortlronormal set of basis function for regression can result in an 
often unwanted prior on the solutions such that an orthonormal or whitened 
basis is preferable for this task. We have shown that doing standard regression 
in a whitened basis is equivalent to using a special whitening regularizer for the 
non-ortlronormal function set that can be estimated from unlabelled data. 

Our results indicate that whitened regression using polynomial bases leads 
only to small improvements in most cases. In some cases, however, the improve- 
ment is significant, particularly in cases where the standard polynomial regres- 
sion could not find a non-trivial solution. As a consequence, whitened regression 
is always an option to try when unlabelled data is available, or when reasonable 
assumptions can be made about the input distribution. 
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Abstract. A fast algorithm for the detection of independently moving objects by 
an also moving observer by means of investigating optical flow fields is presented. 
Since the measurement of optical flow is a computationally expensive operation, it 
is necessary to restrict the number of flow measurements. The proposed algorithm 
uses two different ways to determine the positions, where optical flow is calculated. 
A part of the positions is determined using a particle filter, while the other part of the 
positions is determined using a random variable, which is distributed according 
to an initialization distribution. This approach results in a restricted number of 
optical flow calculations leading to a robust real time detection of independently 
moving objects on standard consumer PCs. 



1 Introduction 

The detection of independently moving objects by an also moving observer is a vital abil- 
ity for any animal. The early detection of an enemy while moving through visual clutter 
can be a matter of life and death. Also for modern humans it is useful, e.g. for collision 
prevention in traffic. Using the human head as an inspiration, a lightweight monocular 
camera mounted on a pan-tilt-unit (PTU) is chosen to investigate the environment in 
this application. The analysis of optical flow fields gathered from this camera system is 
a cheap and straight forward approach avoiding heavy and sensitive stereo rigs. Since 
determining highly accurate optical flow with subpixel precision is a computationally 
expensive operation, restrictions on the maximum number of optical flow computations 
have to be made in real time environments. The approach chosen in this work is inspired 
by [8] and determines the sample positions (i.e. points where optical flow will be cal- 
culated) partly by using a vector of random variables, which are distributed according 
to an initialization distribution function (IDF), and partly by propagating samples from 
the last time step using a particle filter approach. While a wide range of literature on the 
application of particle filters to tracking tasks [8,9,12] and lately on improvements on 
the particle filter to overcome the degeneracy problem [5,6,10,15] exist, only little work 
has been done in the field of using such probabilistic techniques for the investigation 
and interpretation of optical flow fields: In [2] motion discontinuities are tracked using 
optical flow and the CONDENSATION algorithm and in 2002 [16] used a particle fil- 
ter to predict and therefore speedup a correlation based optical flow algorithm. In the 

* This work was supported by BMBF Grant No. 1959156C. 
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following sections, the basic concept used for the detection of independent motion is 
explained first. The particle filter system used to speedup and stabilize the detection of 
independent motion is developed next. Finally experiments with synthetic and real data 
are shown. 



2 Detection of Independently Moving Objects 

The basic concepts used for the detection of independently moving objects by a moving 
observer through investigation of the optical flow are introduced in this section. 



Computation of the Optical Flow: A large number of algorithms for the computation 
of optical flow exist [1], Any of these algorithms calculating the full 2D optical flow can 
be used for the proposed algorithm. Algorithms calculating the normal flow only (i.e. the 
flow component component parallel to the image gradient) are, however, inappropriate. 
The optical flow in this work is calculated using an iterative gradient descend algorithm 
[11], applied to subsequent levels of a Gaussian image pyramid. 




Fig. 1. Theoretical flow fields for a simple scene. The 3D scene is shown at (a). The scene consists 
of 3 blocks. The camera, displayed as small pyramids, translates towards the blocks while rotating 
around the y axis. The flow field F as induced by this movement is shown in (b). Its rotational 
component Fr (c) and translational component Ft (d) with the Focus of Expansion (FOE) are 
shown on the right. 



Detection of Independent Motion: Optical flow fields consists of a rotational part and 
a translational part (Fig. 1). The rotational part is independent of the scene geometry 
and can be computed from the camera rotation. Subtraction of the rotational flow field 
from the overall flow field results in the translational flow field, where all flow vectors 
point away from the focus of expansion (FOE), which can be calculated from the camera 
motion. With known camera motion, only the direction of the translational part of the 
optical flow field can be predicted. The angle between the predicted direction and the 
(also rotation corrected) flow calculated from the two images serves as a measure for 
independent motion [14](Fig. 2). This detection method requires the exact knowledge 
of the camera motion. In our approach, the camera motion can be derived from rotation 
sensor and speed sensor data of the car, or it can alternatively be measured directly from 
the static scene [13]. 




Fast Monocular Bayesian Detection of Independently Moving Objects 



29 



3 Particle Filter 

First the general concept of the CONDENSATION algorithm is summarized. Then the 
application of a particle filter for detection of independent motion is described. 



3.1 CONDENSATION 



The CONDENSATION algorithm is designed to handle the task of propagating any 
probability density function (pdf) over time. Due to the computational complexity of 
this task, pdfs are approximated by a set of weighted samples. The weight 7r„ is given 
by 






Pz{s {n) ) 

Ef=i pM j) ) 



(i) 



where p~{x) = p(z\x) is the conditional observation density representing the probability 
of a measurement z, given that the system is in the state x. s 1 -" 1 represents the position 
of sample n in the state space. 



Propagation: From the known a priori pdf, samples are randomly chosen with regard 
to their weight n In doing so, a sample can be chosen several times. A motion model 
is applied to the sample positions and diffusion is done by adding Gaussian noise to 
each sample position. A sample that was chosen multiple times results in several spatial 
close samples after the diffusion step. Finally the weight is calculated by measuring the 
conditional observation p(z\x) and using it in eq. 1. The a posteriori pdf represented 
by these samples is acting as a priori pdf in the next time step. This iterative evaluation 
scheme is closely related to Bayes’ law 



K*W - 



(2) 



where p(z) can be interpreted as a normalization constant, independent of the system 
state x [8]. The sample representation of the posteriori pdf p(x\z) is calculated by 
implicitly using the a priori pdf p(x) as the sample base from which new samples are 
chosen and the probability of a measurement p(z\x) given a certain state of the system 
x (eq. 1). 



Initialization: In order to initialize without human interaction a fraction of the samples 
are chosen by using a random variable which is distributed according to an initialization 
distribution in every time step. In the first time step, all samples are chosen in this manner. 

3.2 Bayesian Detection of Independent Motion 

First an overview over the proposed algorithm is given, then the algorithm is explained 
in detail. Since optical flow (OF) is computationally expensive, the number of OF mea- 
surements have to be restricted. However, when computing OF at sparse locations, one 
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would like to capture as much flow on independently moving objects as possible. An 
adapted particle filter is chosen for this task. In this application the probability for a 
position belonging to an independently moving object is chosen as the pdf for the CON- 
DENSATION algorithm, resulting in a state dimension of 2. A fraction of the samples 
are chosen by using propagating samples from the last time step using the CONDEN- 
SATION approach. Hereby samples are chosen randomly with respect to their weight. 
Samples with a high weight (a high probability for an independently moving object) are 
chosen with a higher possibility. In general these high weight samples are chosen multi- 
ple times, resulting in more samples in the vicinity of the old sample after the diffusion 
in the next time step. The remaining part of the samples are generated by using a random 
variable with a distribution depending on the image gradient. OF is measured at each 
sample position. 



Modifications of the standard CONDENSATION algorithm: A number of adapta- 
tions have been made to the CONDENSATION algorithm to ensure faster processing 
and optimization of the sample positions for the flow measurements: 

- Initialization Function: The measurement of OF is only possible on regions with 
spatial structure. The lower eigenvalue of the structure tensor or “cornerness” [3] is 
chosen as initialization distribution density function (IDF). By placing samples ran- 
domly with respect to this IDF, the majority of the samples are located on positions 
with high “cornerness” and hence giving optimal conditions for the calculation of 
the OF. Due to the random nature of the sample placing, some samples are however 
placed in regions with lower spatial structure, giving less optimal conditions for OF 
calculation, but on the other hand allowing the detection of independently moving 
objects in these regions. Obviously, there has to be a lower bound on the minimum 
spatial structure necessary for OF calculation. To ensure a fast detection of moving 
objects, the fraction of samples positioned in this way is chosen to be as high as 
0.7. This high initialization fraction obviously disturbs the posterior pdf, but on the 
other hand improves the response time of the detector. The high fraction of samples 
generated by using the IDF also reduces the degeneracy problem of particle filters. 

- Discretizing of the State Space: The sample positions are discretized, i.e. a sample 
cannot lie between pixels. This leads to the fact that multiple samples are located on 
the same location in state space, i.e. on the same pixel. Obviously only one expensive 
measurement of OF is necessary for all those samples located on the same pixel. 
This leads not to a reduction of the sample numbers, but only to a reduction of 
the necessary measurements and probability calculations (typically by 25%) and 
therefore speeds up the process. 

- Motion Model: In the special case of applying Bayesian sampling to locations of OF 
measurements, no motion model of the underlying process is needed, because every 
measurement (i.e. optical flow = apparent motion of a point between two consecu- 
tive frames) represents the motion of the according sample itself. The new sample 
position can be predicted by using the old position and adding the OF measured at 
this position. 

- Non-Isotropic Diffusion: In typical traffic situations, large portions of the images are 
very low structured (e.g. the asphalt of the road), therefore a modified diffusion step 
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is used to increase the number of sample positions on structured image regions: A 
pointwise multiplication of the standard 2D Gaussian function with the cornerness 
in a window around the actual position is used as the diffusion density function. The 
window size is determined by the variances of the diffusion. Choosing randomly 
once with respect to this density results in the new sample position. 




Fig. 2. Detection of moving object by the angle 
between the predicted flow direction (pointing 
away from FOEs) and the measured flow di- 
rection (pointing away from FOEm)- 
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Fig. 3. The probability that a flow measurement 
is located on an independently moving object 
Pci(c a ) in dependence of c a = cos(a) at a 
given inflection point Ci = 0.7. 



Measurement. The measurement at each sample position should represent the proba- 
bility p(x) that this sample is located on an independently moving object. Let a denote 
the angle between the predicted translational optical flow pointing away from FOEg 
and the rotation corrected OF vector pointing away from FOE^/ (see Fig. 2). For speed 
reasons, c a = cos(a) is used as a basis for the calculation of this probability [14], 

The probability for an independently moving object p Ci ( c a ) in dependence of c a is 
modeled as a rounded step function: 



Pci{c a ) 



e f(ci)-c a +ln(0.5)-Ci-f(ci) jf > c ., 

1.0 - e -^ Ci > c “ +ln(0 - 5 ) +Ci '' f ( Ci ) if C a < d. 



(3) 



where /(ci) = to( ° 1 ° ( ^_|* n | (0 ' o ' ) is a function of the inflection point Cj. Since it is not 
feasible to set probabilities to 1.0 or 0.0, p Ci (c a ) is scaled and shifted to represent a 
minimum uncertainty. Fig. 3 shows p Ci (c a ). In the proposed algorithm, the inflection 
point is chosen automatically to be c* = c a — a Ca , where c a is the median of the all 
cosine angles not detected as “moving” in the last time step, and <t Cq is the variance of the 
c a . Choosing c, automatically has the advantage, that erroneous camera positions do not 
disturb the measurement. This only holds under the assumption that more than half of the 
flow vectors are located on the static scene. Similar terms ensuring a minimum cornerness 
p c (since OF can only be computed with spatial structure), a minimum flow length pf 
(standing for the accuracy of the OF computation) and a minimum distance from the focus 
of expansion pfoe (since errors in the FOE position influence the direction prediction 
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for closer points more than for further points) are introduced. The overall probability 
p(x) = p(z\x) is then given by: 

p(x) = Pa(Ca) ■ Pc- Pf ■ Pfoe (4) 

Spatio Temporal Filtering. In order to detect whether independently moving objects are 
present, the sampled observation density is investigated. An outlier observation density 
image is constructed by superimposing Gaussian hills with a given sigma for all sample 
positions. In order to further increase the robustness a temporal digital low pass biter 
is used on the outlier observation image density sequence. A user selected threshold on 
the output of this biter is used to mark independently moving objects. 



4 Experiments 

Experiments were carried out using synthetic images and sensor information as well as 
images and sensor data gathered with the Urban Traffic Assistant (UTA) demonstrator 
from the DaimlerChrysler AG [4], 




Fig. 4. Some images from the synthetic intersection sequence. The camera is moving on a straight 
line, while the car in the image is on a collision course. Points where the biter output is above a 
threshold of 0.35 are marked white. 



Simulated Data. To test the algorithm a simulated intersection was realized in VRML. 
Simple block models of houses, textured with real image data, are located on the comers 
of the intersecting street (bg. 4). A model of a car was used as an independently moving 
object. Screenshots of a ride through this intersection provided the image data, while 
the sensor information was calculated from the known camera parameters at the time of 
the screenshots. Fig. 4 shows some images from the simulated image sequence. Points 
where the spatio-temporal biter output is above 0.35 are marked with white blobs. Only 
very few points are detected because the synthetic car color is uniform due to the simple 
texture model. 



Real Data. The setup of UTA [4] includes a digital camera mounted on a pan-tilt-unit 
(PTU), GPS, map data, internal velocity and yawrate sensors, etc. The fusion of GPS and 
map data will be used to announce the geometry of an approaching intersection to the 
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Fig. 5. Some images from a real intersection sequence. Points where the spatio-temporal filter 
output is above 0.35 are marked white. 

vision system. The camera then focuses on the intersection. Using the known egomotion 
of the camera, independently moving objects are detected and the driver’s attention can 
be directed towards them. Fig. 5 shows the results on a real world image sequence. 



Timing. The computation frequency is 18.2 ± 2.0 frames per second (fps) for the 
synthetic sequence and 20.0 ± 2.4 fps for the real world sequence. These timings were 
measured on a standard 2.4 GHz Pentium IV PC with an overall number of 1000 samples. 
The optical flow used a pyramid of size 3. 
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Fig. 6. False positive and detection rates for a synthetic (left) and real world (right) sequence. The 
ground truth image segmentation needed for obtaining these rates was known in the synthetic case 
and was generated by hand in the real world case. The moving object was approximated in the 
real world case by several rectangles. In the synthetic sequence (A) moving objects were visible 
between frame 72 and frame 117. In the real world image sequence (B), an image sequence of 80 
images was evaluated. See text for further details. 



Detection Rates. The detection rates and false positive rates were calculated on a pixel 
basis using a known image segmentation: For every pixel where the optical flow has been 
calculated, it is determined whether it is a false positive or a true detection, resulting in a 
detection rate of 100 % when every flow measurement on the moving object is detected 
as such by the algorithm. In the case of synthetic image data, the segmentation could 
be derived from the known 3D scene structure, in the case of the real world sequence, 
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the image was segmented by hand. Several rectangles thereby approximated the moving 
object. Fig. 6 shows that a high detection rate combined with a low (nearly zero) false 
positive rate could be obtained with the chosen approach. The remaining false positive 
rate results from the spatio temporal filtering of the results. All false positives are located 
spatially very close to the moving object. Since the camera and the moving object were 
on collision course, independent motion was detected mainly at the object boundaries 
(car front). In the parts of the sequence where no moving object was visible the false 
positive rate stayed very low, causing no false object alarms. The evaluation of the real 
sequence showed essentially the same behavior and proves the robustness against noise, 

5 Conclusions and Further Work 

A fast and robust Bayesian based system for the detection of independently moving 
objects by a moving observer has been presented. The two advantages motivating the 
chosen approach lead to a very fast and robust algorithm: 

1 . By choosing the IDF to depend on the image gradient, most samples are positioned 
in high contrast regions resulting in optimal conditions for the calculation of optical 
flow. Because the IDF is however only the distribution function for the randomly 
chosen sample positions, their positions are not restricted to high contrast regions, but 
some of them are also positioned in lower contrast regions. This allows the detection 
of independently moving objects also in these lower contrast regions, while at the 
same time a maximum sample number and therfore a maximum computation time 
is guaranteed. 

2. The use of the particle filter approach leads to a clustering of flow measurements in 
regions where independent motion was detected in the last time step. The surround- 
ing flow measurements can be used to either confirm or reject the existence of an 
independently moving object by using a spatio-temporal filter. 

Experiment with synthetic and real image data were accomplished. Further work should 
include: 

- investigation of the trajectory extraction possibility of moving objects 

- fast robust egomotion estimation refinement by fusing sensor information (speed, 
yawrate and steering angle) with image based measurements (optical flow from 
static scene) 
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Abstract. We address the problem of image segmentation with statis- 
tical shape priors in the context of the level set framework. Our paper 
makes two contributions: Firstly, we propose a novel multi-modal statis- 
tical shape prior which allows to encode multiple fairly distinct training 
shapes. This prior is based on an extension of classical kernel density 
estimators to the level set domain. Secondly, we propose an intrinsic reg- 
istration of the evolving level set function which induces an invariance of 
the proposed shape energy with respect to translation. We demonstrate 
the advantages of this multi-modal shape prior applied to the segmenta- 
tion and tracking of a partially occluded walking person. 



1 Introduction 

When interpreting a visual scene, human observers generally revert to higher- 
level knowledge about expected objects in order to disambiguate the low-level 
intensity or color information of the given input image. Much research effort has 
been devoted to imitating such an integration of prior knowledge into machine- 
vision problems, in particular in the context of image segmentation. 

Among variational approaches, the level set method [16,10] has become a 
popular framework for image segmentation. The level set framework has been 
applied to segment images based on numerous low-level criteria such as edge 
consistency [13,2,11], intensity homogeneity [3,22], texture information [17,1] 
and motion information [6]. 

More recently, it was proposed to integrate prior knowledge about the shape 
of expected objects into the level set framework [12,21,5,20,8,9,4]. Building up on 
these developments, we propose in this paper two contributions. Firstly, we intro- 
duce a statistical shape prior which is based on the classical kernel density esti- 
mator [19,18] extended to the level set domain. In contrast to existing approaches 
of shape priors in level set segmentation, this prior allows to well approximate 
arbitrary distributions of shapes. Secondly, we propose a translation-invariant 
shape energy by an intrinsic registration of the evolving level set function. Such 
a closed-form solution removes the need to locally update explicit pose param- 
eters. Moreover, we will argue that this approach is more accurate because the 
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resulting shape gradient contains an additional term which accounts for the effect 
of boundary variation on the location of the evolving shape. Numerical results 
demonstrate our method applied to the segmentation of a partially occluded 
walking person. 



2 Level Set Segmentation 

Originally introduced in the community of computational physics as a means of 
propagating interfaces [16] 1 , the level set method has become a popular frame- 
work for image segmentation [13,2,11]. The central idea is to implicitly represent 
a contour C in the image plane 17 C R 2 as the zero-level of an embedding func- 
tion tp : 17 — > K: 



C = {xef2 I <t>(x) = 0} (1) 

Rather than directly evolving the contour C, one evolves the level set function tp . 
The two main advantages are that firstly one does not need to deal with control 
or marker points (and respective regridding schemes to prevent overlapping). 
And secondly, the embedded contour is free to undergo topological changes such 
as splitting and merging which makes it well-suited for the segmentation of 
multiple or multiply-connected objects. 

In the present paper, we use a level set formulation of the piecewise constant 
Mumford-Shah functional, c.f. [15,22,3]. In particular, a two-phase segmentation 
of an image / : 17 — > R. can be generated by minimizing the functional [3] : 



E cv (tp) = J (I - u + ) 2 Htp(x)dx 



J (J — it_) 2 (l— Htp(x))dx 

n 




n 



\VHtP\dx, 

(2) 



with respect to the embedding function <f > . Here Hep = H(tp) denotes the Heavi- 
side step function and u+ and u- represent the mean intensity in the two regions 
where cp is positive or negative, respectively. While the first two terms in (2) aim 
at minimizing the gray value variance in the separated phases, the last term 
enforces a minimal length of the separating boundary. Gradient descent with 
respect to <p amounts to the evolution equation: 



dtp 

dt 



dEcv 

dtp 



Se{<P) 



v div 




(■ I-u + f + {I-u _) 2 . 



(3) 



Chan and Vese [3] propose a smooth approximation S e of the delta function 
which allows the detection of interior boundaries. 

In the corresponding Bayesian interpretation, the length constraint given by 
the last term in (2) corresponds to a prior probability which induces the segmen- 
tation scheme to favor contours of minimal length. But what if we have more 

See [10] for a precursor containing some of the key ideas of level sets. 
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Fig. 1. Sample training shapes (binarized and centered). 



informative prior knowledge about the shape of expected objects? Building up 
on recent advances [12,21,5,20,8,9,4] and on classical methods of non-parametric 
density estimation [19,18], we will in the following construct a shape prior which 
statistically approximates an arbitrary distribution of training shapes (without 
making the restrictive assumption of a Gaussian distribution). 

3 Kernel Density Estimation in the Level Set Domain 

Given two shapes encoded by level set functions </>i and (f> 2 , one can define their 
distance by the set symmetric difference (cf. [4]): 

d 2 (Hfa,Hfa) = j (Hfa(x) - H(j) 2 (x)Y dx. (4) 

n 

In contrast to the shape dissimilarity measures discussed in [20,8], the above 
measure corresponds to an ^-distance, in particular it is non-negative, sym- 
metric and fulfills the triangle inequality. Moreover it does not depend on the 
size of the image domain (as long as both shapes are entirely inside the image). 

Given a set of training shapes - see for example Figure 1 - one can 

estimate a statistical distribution by reverting to the classical Parzen-Rosenblatt 
density estimator [19,18]: 

1 N / 1 \ 

a TV ^ 6XP V 2 -Z2 d2 ( H ^ H ^)j ■ ( 5 ) 

i= 1 ' ' 

This is probably the theoretically most studied density estimation method. It 
was shown to converge to the true distribution in the limit of infinite training 
samples (under fairly mild assumptions) . There exist extensive studies as to how 
to optimally choose the kernel width a. For this work, we simply fix a to be the 
mean nearest-neighbor distance: 

1 N 

cr 2 = — ^ min d 2 (H(j>i, H fa ) . (6) 

JV , 3 ^ 



The intuition behind this choice is that the width of the Gaussians is chosen such 
that on the average the next training shape is within one standard deviation. 
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In contrast to existing shape priors which are commonly based on the as- 
sumption of a Gaussian distribution (cf. [12]), the distribution in (5) is a multi- 
modal one (thereby allowing more complex training shapes). We refer to [7] for 
an alternative multi-modal prior for spline-based shape representations. 



4 Translation Invariance by Intrinsic Alignment 



By construction the shape prior (5) is not invariant with respect to certain 
transformations of the shape </> such as translation, rotation and scaling. In 
the following, we will demonstrate how such an invariance can be integrated 
analytically by an intrinsic registration process. We will detail this for the case 
of translation. But extensions to rotation and scaling are straight-forward. 

Assume that all training shapes {&} are aligned with respect to their center 
of gravity. Then we define the distance between a shape <j> and a given training 
shape as: 

d 2 {H(j) 1 H(j>i) = J {H(j>{x - x^) - H<f>i(x)) 2 dx, (7) 

n 



where the function tj> is evaluated in coordinates relative to its center of gravity 
given by: 



J x Hcftdx 
J<t> f H(j)dx 



(8) 



This intrinsic alignment guarantees that in contrast to (4), the distance (7) is 
invariant to the location of the shape (j). The corresponding shape prior (5) 
is by construction invariant to translation of the shape q b. Analogous intrinsic 
alignments with respect to scale and rotation are conceivable but will not be 
considered here. 

Invariance to certain group transformations by intrinsic alignment of the 
evolving shape as proposed in this work is different from numerically optimizing 
a set of explicit pose parameters [5,20,8]. The shape energy is by construction 
invariant to translation. This removes the necessity to intermittently iterate 
gradient descent equations for the pose. Moreover, as we will see in Section 6, 
this approach is conceptually more accurate in that it induces an additional 
term in the shape gradient which accounts for the effect of shape variation on 
the center of gravity x </,. Current effort is focused on extending this approach to 
a larger class of invariance. For explicit contour representations, an analogous 
intrinsic alignment with respect to similarity transformation was proposed in [7]. 



5 Knowledge-Driven Segmentation 

In the Bayesian framework, the level set segmentation can be seen as maximizing 
the conditional probability 

V(I) 






(9) 
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with respect to the level set function <f>, where V(I) is a constant. This is equiv- 
alent to minimizing the negative log-likelihood which is given by a sum of two 
energies: 



E(4>) = —E cv ((j)) + E shave (4>), 
a 

with a positive weighting factor a and the shape energy 

E shape (0) = -l0gP(<A), 



(10) 



( 11 ) 



where V((f>) is given in (5). 

Minimizing the energy (10) generates a segmentation process which simulta- 
neously aims at maximizing intensity homogeneity in the separated phases and 
a similarity of the evolving shape with respect to the training shapes encoded 
through the statistical estimator. 

Gradient descent with respect to the embedding function amounts to the 
evolution: 



dcj) 1 dE cv dE s 

hape (19^ 

dt^~a~d4> <90 ’ 1 ’ 

with the image-driven component of the flow given in (3) and the knowledge- 
driven component is given by: 

dE shape 

d<t> 2 1 6) 

which simply induces a force in direction of each training shape </> weighted by 
the factor: 



= exp 







which decays exponentially with the distance from shape 0 , . 



(14) 



6 Euler- Lagrange Equations for Nested Functions 

The remaining shape gradient in equation (13) is particularly interesting since 
the translation-invariant distance in (7) exhibits a two-fold (nested) dependence 
on 0. The computation of the corresponding Euler-Lagrange equations is fairly 
involved. For space limitations, we will only state the final result: 

(# 0 ( x) - Hcj)i(x + x ^ 

~ ( jHcfrdx J ^H(/)( x ) dx . 



d 2 (H(j), Hepi) =26 (0(x)) 



(15) 
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Fig. 2. Various frames showing the segmentation of a partially occluded walking per- 
son generated with the Chan-Vese model (2). Based on a pure intensity criterion, the 
walking person cannot be separated from the occlusion and darker areas of the back- 
ground such as the person’s shadow. 

Note that as for the image-driven component of the flow in (3) , the entire expres- 
sion is weighted by the 5-function which stems from the fact that the function 
d only depends on H(f>. While the first term in (15) draws H(j> to the template 
H(f>i in the local coordinate frame, the second term compensates for shape de- 
formations which merely lead to a translation of the center of gravity x Not 
surprisingly, this second term contains an integral over the entire image domain 
because the change of the center of gravity through local deformation of <fi de- 
pends on the entire function </>. In numerical experiments we found that this 
additional term increases the speed of convergence by a factor of 3 (in terms of 
the number of iterations necessary) . 

7 Tracking a Walking Person 

In the following we apply the proposed shape prior to the segmentation of a par- 
tially occluded walking person. To this end, a sequence of a dark figure walking 
in a (fairly bright) squash court was recorded. 2 We subsequently introduced a 
partial occlusion into the sequence and ran an intensity segmentation by iter- 
ating the evolution (3) 100 times for each frame (using the previous result as 
initialization). For a similar application of the Chan-Vese functional (without 
statistical shape priors), we refer to [14]. The set of sample frames in Figure 2 
clearly demonstrates that this purely image-driven segmentation scheme is not 
capable of separating the object of interest from the occluding bar and similarly 
shaded background regions such as the object’s shadow on the floor. 

In a second experiment, we manually binarized the images corresponding to 
the first half of the original sequence (frames 1 through 42) and aligned them to 
their respective center of gravity to obtain a set of training shape - see Figure 1. 
Then we ran the segmentation process (12) with the shape prior (5). Apart from 
adding the shape prior we kept the other parameters constant for comparability. 

Figure 3 shows several frames from this knowledge-driven segmentation. 
A comparison to the corresponding frames in Figure 2 demonstrates several 

2 We thank Alessandro Bissacco and Payam Saisan for providing the image data. 
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Fig. 3. Segmentation generated by minimizing energy (10) combining intensity in- 
formation with the statistical shape prior (5). Comparison with the respective frames 
in Figure 2 shows that the multi-modal shape prior permits to separate the walking 
person from the occlusion and darker areas of the background such as the shadow. The 
shapes in the bottom row were not part of the training set. 



properties of our contribution: 

— The shape prior permits to accurately reconstruct an entire set of fairly 
different shapes. Since the shape prior is defined on the level set function 
c t> - rather than on the boundary C (cf. [5]) - it can easily reproduce the 
topological changes present in the training set. 

— The shape prior is invariant to translation such that the object silhouette 
can be reconstructed in arbitrary locations of the image. All training shapes 
are centered at the origin, and the shape energy depends merely on an in- 
trinsically aligned version of the evolving level set function. 

— The statistical nature of the prior allows to also reconstruct silhouettes which 
were not part of the training set (beyond frame 42). 



8 Conclusion 

We combined concepts of non-parametric density estimation with level set based 
shape representations in order to create a statistical shape prior for level set 
segmentation which can accurately represent arbitrary shape distributions. In 
contrast to existing approaches, we do not rely on the restrictive assumptions of 
a Gaussian distribution and can therefore encode fairly distinct shapes. 

Moreover, we proposed an analytic solution to generate invariance of the 
shape prior to translation of the object of interest. By computing the shape 
prior in coordinates relative to the object’s center of gravity, we remove the need 
to numerically update a pose estimate. Moreover, we argue that this intrinsic 
registration induces a more accurate shape gradient which comprises the effect 
of shape or boundary deformation on the pose of the evolving shape. 

Finally, we demonstrate the effect of the proposed shape prior on the seg- 
mentation and tracking of a partially occluded human figure. In particular, these 
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results demonstrate that the proposed shape prior permits to accurately recon- 
struct occluded silhouettes according to the prior in arbitrary locations (even 
silhouettes which were not in the training set). 
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Abstract. A head pose estimation system is described, which uses low 
resolution video sequences to determine the orientation and position of 
a head with respect to a internally calibrated camera. The system em- 
ploys a feature based approach to roughly estimate the head pose and an 
approach using a symmetry based illumination model to refine the head 
pose independent of the users albedo and illumination influences. 

1 Introduction 

3D head pose estimation and tracking from monocular video sequences is a very 
active field of research in computer vision. In this paper we want to introduce 
a 3D head pose estimation system which is designed to initialize a tracking 
framework to track arbitrary movements of a head. This paper concentrates on 
the initialization part of our system which has to work on low resolution video 
sequences. The head usually covers 60x40 pixels in the images, and it has to be 
robust with respect changes in illumination, facial gestures and different users. 
A number of different approaches have been proposed for the problem of 3D 
head pose estimation and tracking. Some using a 3D head model and tracking 
distinct image features through the sequence [2], [7], [4], [1]. The image features 
correspond to anchor points on the 3D head model which then can be aligned 
accordingly and the head pose is estimated. Another approach is to model 3D 
head movement as a linear combination of a set of bases, that are generated by 
changing the pose of the head and computing a difference image of the poses 
[14]. The coefficients of the linear combination that models the difference image 
best is used to determine the current head pose. A third popular approach is to 
employ optical flow constrained by the geometric structure of the head model [6] , 
[5], [16]. Since optical flow is very sensitive with respect to illumination changes, 
[10] also included an illumination basis to model illumination influences in his 
optical flow approach. Except for [6] all approaches work on high resolution 
images. Except for [2] none of the mentioned approaches includes an automatic 
initialization of the tracking without the user to keep still and to look straight 
into the camera. 

There are a number of 3D face pose estimation approaches. Systems which do 
not require high resolution images of the face, either lack the required accuracy 
which is needed to initialize a tracking system [8], or are not illumination and 
person invariant [13]. 
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Fig. 1 . Structure of head pose estimation system. 



2 Motivation 

A fully automatic head pose tracking system has to automatically initialize the 
tracking. This includes a reliable estimation of the 3D head pose without any cus- 
tomized model information. Some systems require the user to look straight into 
the camera to initialize [7], [14], [16], others only determine the relative change 
of pose with respect to the pose in the first frame [4], or the user is required 
to identify certain features like eyes and nose manually [5]. Since we intend to 
automatically initialize and track a head, in low resolution video sequences, all 
the above approaches are not an option. Our goal is to reliably estimate the head 
pose with respect to the camera if the deviation of the current orientation from 
a frontal view does not exceed 30 degrees. In case these conditions are not met 
by the current head pose, no initialization should occur until we reach a frame 
with a head pose that does meet the condition. One can think of it as a trap. 
The initialization process is divided into five parts, see Fig. 1 

3 Implemented System 

3.1 2D Face Detection 

First we employ a face detection algorithm that is capable of detecting faces 
that meet our orientation condition. We use the OpenCV implementation [http : 
/ / source f or ge.net / projects / opencvlibrary] of the detection algorithm proposed 
by Viola and Jones [11] which works very fast and reliable. The face detection 
gives us a region of interest (ROI) in the image which is passed on to the next 
step of the initialization. 

3.2 Facial Feature Detection and Rough Pose Estimation 

In order to roughly estimate the current head pose we intend to detect the image 
positions of the eyes and the tip of the nose. A radial symmetry interest operator 
is employed [3] on the upper part of the ROI to detect possible eyes. Since in 
the low resolution images, each eye is usually a dark radial feature surrounded 
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by a brighter area, eyes yield a large answer in the radial symmetry analysis. 
It is still difficult though to make an exact prediction for the eyes. Instead of 
taking the two largest local maximums from the radial symmetry analysis we 
rather allow 15 hypotheses of possible eye positions at this stage, to be sure to 
include the correct ones. The same strategy is used for the tip of the nose. Due 
to its prominent position, the tip of the nose reflects light very well and usually 
appears as a bright radial feature on a darker background. 3 hypotheses usually 
suffice for the nose to include the correct position. 

For every combination of 2 eyes and a nose we can compute a resulting head 
pose using a weak geometry projection. We have the 3D object coordinates of 
the eyes and the nose on a 3D head model of an average person, and the 3 
corresponding image positions of the combination. Having computed the pose 
of every combination, we can discard all the combinations which deviate more 
than 30 degrees from a frontal view. These heuristics usually do reduce the 
number of relevant combinations significantly. The remaining combinations are 
evaluated. For this evaluation we use a database of 3000 different eyes and noses 
in gabor wavelet space [15]. Each eye hypothesis and nose hypothesis is compared 
against the database and receives a final feature score. This feature score is the 
similarity value of the database entry that fits best. The sum of the feature- 
scores of the combination yields the combination-score. The associated pose of 
the combination that received the highest combination-score is an estimate of 
the current head pose. 

3.3 Symmetry Considerations 

For the refinement of the initialization a novel, illumination and albedo insensi- 
tive, symmetry based approach is employed. First we assume that the texture of 
the right half of the face is symmetric to the left. By projecting a 3D model of 
a face under the estimated pose into the image, we can extract the underlying 
texture of the face from the image. Now consider a point p on a lambertian sur- 
face. Ignoring attatclred shadows, the irradiance value E p of the surface point p 
is given by 



E p = k p ( N p ■ Lp) ( 1 ) 

where k p is the nonnegative absorption coefficient (albedo) of the surface at point 
p , N p is the surface normal at point p, and L g JR 3 characterizes the collimated 
light source, where ||X|| gives the intensity of the light source. The gray level 
intensity I p measured by a camera is an observation of E p . We can therefore 
write 



Ip ~ kp ( Np • L p ) 



(2) 



We now assume that a face is symmetric with respect to a mirror axis along 
the nose. Therefore we can assume 



kpr ~ kpl ~ (N pr ■ L ) 



J-p r 
I p r 



Ipl 

( N p i ■ L) 



Ipr (N pr * L') 

I P i (N p i ■ L) 



(3) 
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where k pr is the albedo of a point pr on the right side of the face and k p i is the 
albedo of the symmetrically corresponding point pi on the left side of the face. 
Fig. 2 illustrates the computation. 




Fig. 2. From left to right: Extracted texture of the head under the current pose. Face 
texture divided into right and left half. Division of flipped right half of the face texture 
and the left half of the face texture. Quotient. Right: Illumination basis with 10 basis 
vectors that approximately span the space of albedo independent face texture quotients 



3.4 Illumination Model 

Following the symmetry considerations, we can now generate a parametric il- 
lumination model for human faces. The following is related to [14] and [10]. In 
contrast to [14] and [10] we do not generate the parametric illumination model 
based on the textures themselves. In order to achieve user independence we use 
the fraction H(I) = yp If we extract the face texture in form of a vector I 3 from 
a set of images of a face illuminated from a different direction in each image j. 
We can then compute the fraction 

H{I) = y (4) 

for each element of these textures I j . By performing a singular value decompo- 
sition on the texture fractions Hj we can generate a small set of 10 basis vectors 
b to form a illumination basis, so that every fraction H j can be expressed as a 
linear combination of the columns of B 



B = [b 1 \b 2 \...\b w ] (5) 

H 3 = Bw (6) 

where w is the vector of linear coefficients. Fig. 2 illustrates the illumination 
basis. Note that the fraction H = y is a value that is independent of the persons 
individual reflectance parameters (albedo). Therefore we do not need to know 
the persons albedo, in that aspect this new approach differs from [12] and [10]. 
Using this measure, it is therefore possible to get a albedo independent model of 
illumination influences of a face. In contrast to [14] where a normally illuminated 
face of the current person is subtracted from each illumination texture in order 
to build a user independent illumination basis, we strictly model illumination 
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with the lambertian illumination model without the need to assume an additive 
nature of illumination influences. 10 basis vectors seem to suffice to account for 
the non lambertian influences and self shadowing. 



3.5 Head Pose Refinement 

In order to further refine the pose estimation we can now use the user indepen- 
dent illumination model and formulate the pose refinement as a least squares 
problem in the following way. Let X- t £ IR 3 be a point on the head in 3D space 
with respect to a head centered model coordinate system. Xi £ IR 2 is the corre- 
sponding 2D point in the image coordinate system. 

Xi is projected onto x, under the current head pose /i, which consists of the 
orientation [/z^; M 2 ; M 3 ] and the 3D position [/i.4; /x 5 ; /x 6 ] of the head with respect 
to the camera. 



X(X,n) = R(fi x, 112, 113) X + t(M 4 ,M 5 ,M 6 ) 



( 7 ) 



The similarity transform in (7) aligns the model coordinate system with respect 
to the camera where R is a rotation about the angles Mi ) M 2 ; M 3 and t is the 
translation fin , /J 5 . /ie that translates the origin of the model coordinate system 
to the origin of the camera coordinate system. The collinearity equation 



x(X) 



' kf-x 
kT-X 



fcj X 1 


T 

; k = 


1 

1 


fcj’-x 






L fc 3 J 



(8) 



formulates the central projection of the aligned model coordinates Xi into the 
image coordinates x t , where K is a matrix containing the intrinsic camera cali- 
bration parameters. The gray value intensities / of the image can be formulated 
as a function of x (9). H can therefore be expressed as a function of fi (10). 



I(x) = I (9) 

H(n) = H(I(x(X(n)))) (10) 



We can now formulate the objective function O that needs to be minimized. We 
want to find a head pose fi that can be confirmed by the illumination basis B 
as well as the roughly detected feature positions of the eyes and the nose in the 
image x n in a least squares sense. We therefore set 

0 (M) w ) = ^P„(cc„(m) - x n ) 2 + ^ 2 Pi(Hi(fi) - Bi(w)) 2 ( 11 ) 

n i 

Pi and P n are weights associated with every element i of the illumination basis 
and every feature point n of the detected feature positions. With these weights 
we can control how much influence the feature points have with respect to the 
illumination basis in the least squares adjustment. To equally weigh the feature 
points and the illumination basis and since we have 3 detected feature points we 
usually set Pi = 1 and P n = \^P%- 
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Since O is nonlinear, we need to expand the function into a taylor series in 
order to be able to iteratively solve the least squares adjustment in the fash- 
ion of a Gauss-Newton Optimization. As a starting point we can use the pose 
estimation from the feature based head pose estimation g 0 . 

By setting 



'Vi/ -vs' 


; 81 = 


H(n o) - B(w 0 ) 


; P — 


diag(Pi) 


0 


Vx 0 


x(no) - x 


0 


diag(P n ) _ 



where A is a Matrix that includes the jacobians, SI is a vector and P is the 
diagonal matrix with the associated weights. 



0(<5/i, 8w) 
[ 5 / j ,, (jw] T 



P(A [■ 5n , Sw] T + SI) 
-{A T PA)~ l A T P5l 



/i 0 = 6n + ; w 0 = Sw + w 0 



(13) 

(14) 

(15) 



Equation (13) formulates the linearized objective function in matrix notation. 
Solving the set of equations VO = 0 gives the solution in (14). Equation (15) 
yields the update of the head pose /./, and the illumination basis coefficients w 
for the next iteration step. Usually 10 iterations suffice for the adjustment to 
converge. After each iteration step a visibility analysis is performed to determine 
for which points on the face both symmetrically corresponding points on the left 
half and on the right half are visible under the current pose g. If either of the two 
symmetrically corresponding points is not visible, the pair of points is excluded 
for the next iteration step. This way we can handle self occlusion. 



3.6 Face Texture Verification 

In order to evaluate the detected pose of the head and to discard gross orientation 
errors it is crucial to verify the detected pose. In our approach we use the face 
texture which was extracted from the image under the current head pose as a 
verification hint. By calculating the distance D from an eigenface basis of face 
textures [9], we can evaluate the current face texture. At this stage we assume 
that a correct head pose yields a face texture with a very low distance from 
the eigenface basis, hence these two measures are correlated. The eigenface basis 
was constructed from a database with 3000 high resolution images of 20 different 
people under varying lighting conditions and facial expressions. 

After the head pose refinement the distance D of the current face texture from 
the eigenface basis can be used to classify the estimated head pose as correct or 
as incorrect . A key feature of the system design is therefore the threshold Tjj for 
the classification h(D) 

= < 16 > 

If we set Tp to a very high value, we will get a relatively large number of false 
positives. If we set Tp to a very low value we will get a relatively small number 
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of false positives but the fraction of true negatives will increase, which leads to a 
lower detection rate. Since the application we have in mind is to initialize a head 
pose tracking system, we are not explicitly interested in a very high detection 
rate. If the initialization fails in one frame, it is possible to simply try again in 
the next frame. The overall detection rate will significantly increase if several 
independent frames are evaluated instead of only one. 

4 Experiments and Results 

So far the system only works off line as a prototype in a Matlab implementation. 
We are confident though to achieve real time performance in a C implementation. 
The most time consuming part is the least squares head pose refinement, since 
it is a non linear adjustment with an iterative solution. Similar implementations 
of least squares approaches have achieved real time performance before though 
[ 6 ], 

To test our system in real world conditions we recorded several sequences 
of 12 different people in low resolution, Fig. 3. The head usually covers 60x40 
pixel. These images include all possible variations of face appearance. Arbitrary 
poses, facial expressions, different illumination conditions, and partial occlusions 
of the face are sampled by these images. In order to generate a ground truth, 
we manually determined the head pose in j = 1500 of these images by labeling 
the center of the eyes, the tip of the nose and the corners of the mouth and 
calculating the six parameters [/A; /A; ji 3; /A; AW AA] of the ground truth pose jl 
from that. The mean errors of this ground truth is given in table 1. Table 1 also 
lists the mean errors of the rough pose estimate, the refined pose estimate and the 
refined pose estimate with a texture verification threshold set to T D = 11. The 
mean errors decrease with each step of the system. It is also worth mentioning 
that the mean errors of our system with respect to the ground truth correspond 
to the accuracies of the ground truth itself, table 1. In other words, the real 
accuracies of our system might even be better. Fig. 3 shows the mean errors of 
the rotational and the translational pose parameters. The diagrams indicate a 
decreasing accuracy of the pose if the threshold To is set to high values in order 
to increase the detection rate and therefore the robustness. We can increase the 
robustness by performing the procedure on several subsequent frames and only 
taking the frame into account which received the lowest value in the distance 
from the eigenface basis D. This increases the detection rate and decreases the 
false positive rate. Fig. 3 shows a ROC diagram for setups with 1 frame, 2 frames 
and 6 frames. Fig. 3 also shows 3 samples of the test results. 

5 Conclusion 

We introduced a system to estimate the head pose in 6 degrees of freedom in low 
resolution images. The system is designed to automatically initialize a 3D head 
pose tracking system, e.g. as in [6]. The system is independent of illumination 
influences and requires no personalization training or user interaction. Since the 
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mean error for n 2 ,4 g [°] 



mean error for (i 4> [mm] 






Fig. 3. Top left and right: Mean errors of parameters of the pose p, with respect to 
the threshold To for setups with 1 frame, 2 frames and 6 frames. As the diagrams 
indicate, the accuracy of the system can not be improved by taking more frames into 
account. Bottom left: ROC diagram for setups with 1 frame, 2 frames and 6 frames. 2 
discrete values for the thresholds To of the face Texture classification are plotted as a 
white dot for the value To = 11 and as a black dot for To = 12. The best results were 
achieved for a setup with 6 frames and a threshold To = 11. With fewer frames taken 
into account, the results gently decrease in quality. Bottom right: 3 samples of the 
test results. The results of the face detection the pose refinement and the face texture 
verification are displayed. 



Table 1. Mean Errors 
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Ground truth 


6 deg. 


6 deg. 


3 deg. 


27 mm 


26 mm 


25 mm 


Rough Head Pose Estimate 


10 deg. 


13 deg. 


5 deg. 


38 mm 


41 mm 


40 mm 


Refined Head Pose 


8 deg. 


7 deg. 


3 deg. 


30 mm 


28 mm 


29 mm 


Refined with Texture verificat. To = 11 


6 deg. 


6 deg. 


3 deg. 


28 mm 


22 mm 


24 mm 
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system is based on global face symmetry only head poses in which both eyes are 
visible will be detected. In our experiments on low resolution images, we achieved 
a detection rate of 90% at a false positive rate of 3% if 6 subsequent frames are 
taken into account. Considering only one single frame we achieved a detection 
rate of 70% at a false positive rate of 6%. Our experiments indicated mean 
orientation errors of m /tl = m^ 2 = 6 degrees and of m^ 3 = 3 degrees respectively. 
The mean positioning errors are about 25 mm in each dimension. This matches 
at least the accuracy of manual head pose estimation in low resolution images. 
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Abstract. We present a new approximation scheme for support vector 
decision functions in object detection. In the present approach we are 
building on an existing algorithm where the set of support vectors is 
replaced by a smaller so-called reduced set of synthetic points. Instead 
of finding the reduced set via unconstrained optimization, we impose 
a structural constraint on the synthetic vectors such that the resulting 
approximation can be evaluated via separable filters. Applications that 
require scanning an entire image can benefit from this representation: 
when using separable filters, the average computational complexity for 
evaluating a reduced set vector on a test patch of size h x w drops from 
O(h-w) to 0(h+w). We show experimental results on handwritten digits 
and face detection. 



1 Introduction 

It has been shown that support vector machines (SVMs) provide state-of-the-art 
accuracies in object detection. In time-critical applications, however, they are of 
limited use due to their computationally expensive decision functions. 

In SVMs the time complexity of a classification operation is characterized 
by the following two parameters. First, it is linear in the number of support 
vectors (SVs). Unfortunately, it is known that for noisy problems, the number 
of SVs can be rather large, essentially scaling linearly with the training set size 
[10]. Second, it scales with the number of operations needed for computing the 
similarity (or kernel function) between an SV and the input. When classifying 
h x w patches using plain gray value features, the decision function requires an 
h ■ w dimensional dot product for each SV. As the patch size increases, these 
computations become extremely expensive: the evaluation of a single 20 x 20 
pattern on a 320 x 240 image at 25 frames per second already requires 660 
million operations per second. For such systems to run in (or at least near) 
real-time, it is therefore necessary to lower the computational cost of the SV 
evaluations as well. 

In the past, however, research towards speeding up kernel expansions has 
focused exclusively on the first issue, i.e., the number of expansion vectors. It has 
been pointed out that one can improve evaluation speed by using approximations 
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with smaller expansion sets. In [2] Burges introduced a method that, for a given 
SVM, creates a set of so-called reduced set vectors (RSVs) which approximate 
the decision function. In the image classification domain, speedups of the order 
of 10 to 30 have been reported [2,3,5] while the full accuracy was retained. 

In contrast, this work focuses on the second issue. To this end, we borrow 
an idea from image processing to compute fast approximations to SVM decision 
functions: by constraining the RSVs to be separable, they can be evaluated 
via separable convolutions. This works for most standard kernels (e.g. linear, 
polynomial, Gaussian and sigmoid) and decreases the computational complexity 
of the RSV evaluations from 0(h ■ w) to 0(h + w). One of the primary target 
applications for this approach is face detection, an area that has seen significant 
progress of machine learning based systems over the last years [7,11,4,6,12,8]. 

2 Unconstrained Reduced Set Construction 

The current section describes the reduced set method [2] on which our work 
is based. To simplify the notation in the following sections, image patches are 
written as matrices (denoted by capital letters). 

Assume that an SVM has been successfully trained on the problem at hand. 
Let {Xi, . . . X m } denote the set of SVs, {cti, . . . a m } the corresponding coeffi- 
cients, fc(-, •) the kernel function and b the bias of the SVM solution. The decision 
rule for a test pattern X reads 



/(X) = sgn Vy^fXaj + fc 



(1) 



A central property of SVMs is that the decision surface induced by / corresponds 
to a lryperplane in the reproducing kernel Hilbert space (RKHS) associated with 
k [9] . The normal is given by 



* = '52y i a i k{X i ,-). (2) 

»= i 

As the computational complexity of / scales with the number of SVs m, we can 
speed up its evaluation using a smaller reduced set (RS) {Z lt . . ,Z m >} of size 
m! < m, i.e. an approximation to if - of the form 



* , = Y,0ik{ Zi,0- (3) 

i — 1 

To find such W , i.e. the Z , and their corresponding expansion coefficients /?$, we 
fix a desired set size to' and solve 

min||^-<Zd||^ KHS . (4) 
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for fa and Z,. Here, || • ||rkhs denotes the Euclidian norm in the RKHS. The 
resulting RS decision function f is then given by 

Pik(Z i,X) + &j. (5) 

In practice, Pi, Z; are found using a gradient based optimization technique. De- 
tails are given in [2]. 



/ m! 

/'(X) = sgn f £ 



3 Constrained Reduced Set Construction 

We now describe the concept of separable filters in image processing and show 
how this idea can be applied to a special class of nonlinear filters, namely those 
used by SVMs during classification. 

3.1 Linear Separable Filters 

Applying a linear filter to an image amounts to a two-dimensional convolution of 
the image with the impulse response of the filter. In particular, if I is the input 
image, H the impulse response, i.e. the filter mask, and J the output image, 
then 



J=I*H. (6) 

If H has size h x w, the convolution requires 0{h ■ w) operations for each output 
pixel. However, in special cases where H can be decomposed into two column 
vectors a and b, such that 



H = ab T 



holds, we can rewrite (6) as 

J = (I * a) * b T , 



(7) 

(8) 



since here, ab T = a* b T , and since the convolution is associative. This splits 
the original problem (6) into two convolution operations with masks of size h x 1 
and 1 x id, respectively. As a result, if a linear filter is separable in the sense 
of equation (7), the computational complexity of the filtering operation can be 
reduced from 0(w ■ h) to Oiw + h ) per pixel by computing (8) instead of (6). 
Note that for this to hold, the size of the image I is assumed to be considerably 
larger than h and w. 



3.2 Nonlinear Separable Filters 

Due to the fact that in 2D, correlation is identical with convolution if the filter 
mask is rotated by 180 degrees (and vice versa), we can apply the above idea to 
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any image filter /(X) = g(c(H,X)) where g is an arbitrary nonlinear function 
and c(H, X) denotes the correlation between images patches X and H (both of 
size h x w). In SVMs this amounts to using a kernel of the form 

fc(H,X) = fl (c(H,X)). (9) 

If H is separable, we may split the kernel evaluation into two ID correlations 
plus a scalar nonlinearity. As a result, if the RSVs in a kernel expansion such as 
(5) satisfy this constraint, the average computational complexity decreases from 
0(m' ■ h ■ w) to 0(rn! ■ (h + w)) per output pixel. This concept works for many 
off-the-shelf kernels used in SVMs. While linear, polynomial and sigmoid kernels 
are defined as functions of input space dot products and therefore immediately 
satisfy equation (9), the idea applies to kernels based on the Euclidian distance 
as well. For instance, the Gaussian kernel reads 

fc(H,X) =exp(7(c(X,X)-2c(H,X) + c(H,H))). (10) 

Here, the middle term is the correlation which we are going to evaluate via 
separable filters. The first term is independent of the SVs. It can be efficiently 
pre-computed and stored in a separate image. The last term is merely a constant 
scalar independent of the image data. Once these quantities are known, their 
contribution to the computational complexity of the decision function becomes 
negligible. 

3.3 The Proposed Method 

In order to compute such separable SVM approximations, we use a constrained 
version of Burges’ method. The idea is to restrict the RSV search space to the 
manifold spanned by all separable image patches, i.e. the one induced by equation 
(7). To this end, we replace the Z, in equation (3) with UjSjV,- 1 . This yields 

m 

( 11 ) 

where, for h x w patches, and are h x 1 and w x 1 vectors of unit length, 
while the scale of the RSV UjSjVj T is encoded in the scalar s,;. Analogously to 
the unconstrained case (4), we solve 

arg min \\& - ^"IIrkhs ( 12 ) 

p,u,s,v 

via gradient decent. Note that during optimization, the unit length of U; and 
Vi needs to be preserved. Instead of normalizing U; and v.^ after every step, we 
use an optimization technique for orthogonal matrices, where the U; and v, are 
updated using rotations rather than linear steps [1]. This allows us to perform 
relatively large steps, while U; and v, stay on the so-called Stiefel manifold which 
in our case is simply the unit sphere in and R UI , respectively. The derivation 
of the rotation matrix is somewhat technical. For detailed information about 
gradient decent on Stiefel manifolds, see [1]. 
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4 Experiments 

We have conducted two experiments: the first one shows the convergence of our 
approximations on the USPS database of handwritten digits [9]. Note that since 
this is usually considered a recognition task rather than a detection problem 
in the sense that we classify single patches as opposed to every patch within a 
larger image, this experiment can only illustrate effects on classification accu- 
racy, not on speed. In contrast, the second part of this section describes how 
to speed up a cascade-based face detection system using the proposed method. 
Here, we illustrate the speedup effect which is achieved by using separable RSV 
approximations during early evaluation stages of the cascade. 

4.1 Handwritten Digit Recognition 

The USPS database contains gray level images of handwritten digits, 7291 for 
training and 2007 for testing. The patch size is 16 x 16. In this experiment we 
trained hard margin SVMs on three two-class problems, namely ”0 vs. rest”, ”1 
vs. rest” and ”2 vs. rest”, using a Gaussian kernel with a = 15 (chosen according 
to [9], chapter 7). The resulting classifiers have 281, 80 and 454 SVs, respectively. 
Classification accuracies are measured via the area under the ROC curve (AUC), 
where the ROC curve plots the detection rate against the false positive rate for 
varying decision thresholds. Hence, an AUC equal to one amounts to perfect 
prediction, whereas an AUC of 0.5 is equivalent to random guessing. 

Figure 1 shows the AUC of our approximations for RS sizes up to m! = 32. 
It further plots the performance of the unconstrained RS approximations as well 
as the full SVM classifier. We found that both unconstrained and constrained 
approximations converge to the full solution as m! grows. As expected, we need 
a larger number of separable RSVs than unconstrained RSVs to obtain the 
same classification accuracy. However, the next experiment will show that in a 
detection setting the accuracy is actually increased as soon as the number of 
required computations is taken into account. 

4.2 Face Detection 

We now give an example of how to speed up a cascade based face detection 
system using our method. The cascaded evaluation [6,12] of classifiers has become 
a popular technique for building fast object detection systems. For instance, 
Romdhani et al. presented an algorithm that on average uses only 2.8 RSV 
evaluations per scanned image position. The advantage of such systems stems 
from the fact that during early evaluation stages, fast detectors discard a large 
number of the false positives [6,12]. Hence, the overall computation time strongly 
depends on how much ’work’ is done by these first classifiers. This suggests 
replacing the first stages with a separable RSV approximation that classifies 
equally well. 

The full SVM was trained using our own face detection database. It consists 
of 19 x 19 gray value images, normalized to zero mean and unit variance. The 
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Fig. 1. Left column, top to bottom: the AUC (area under ROC curve) for the USPS 
classifiers ”0 vs. rest”, ”1 vs. rest” and ”2 vs. rest” , respectively. For the approximations 
(dashed and solid lines), the size parameter m' was varied between 1 and 32. The right 
column shows subsets of the corresponding expansion vectors: In each figure, the top 
row illustrates five (randomly selected) SVs used in the full SVM, while the middle 
and bottom rows shows five of the unconstrained and separable RSVs, respectively. 



training set contains 11204 faces and 22924 non-faces, the test set contains 1620 
faces and 4541 non-faces. We used a Gaussian kernel with cr = 10, the regu- 
larization constant was set to C = 1. This yielded a classifier with 7190 SVs. 
Again, we computed RSV approximations up to size to ' = 32, both separable 
and unconstrained. 
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Fig. 2. Left: accuracies of the unconstrained and constrained approximations in face 
detection. As before, the dotted line shows the accuracy of the full SVM, whereas 
the dashed and solid line correspond to unconstrained and separable RSV classifiers, 
respectively. Right: additionally, we show a subset of the SVs of the full SVM (top row) 
plus five unconstrained and constrained RSVs (middle and bottom row, respectively). 

The results are depicted in Figure 2. Note that for 19 x 19 patches, scanning 
an image with a separable RSV reduces the number of required operations to less 
than 11%, compared to the evaluation of an unconstrained RSV. This suggests 
that for our cascade to achieve the accuracy of the unconstrained m! = 1 classifier 
after the first stage, we may for instance plug in the separable m! = 2 version, 
which requires roughly 22% of the previous operations and yet classifies better 
(the AUC improves from 0.83 to 0.87). Alternatively, replacing the first stage 
with the separable m! = 8 classifier results in an AUC of 0.9 instead of 0.83, 
while the computational complexity remains the same. 

5 Discussion 

We have presented a reduced set method for SVMs in image processing. As 
our constrained RSV approximations can be evaluated as separable filters, they 
require much less computations than their non-separable counterparts when ap- 
plied to complete images. Experiments have shown that for face detection, the 
degradation in accuracy caused by the separability constraint is more than com- 
pensated by the computational advantage. The approach is thus justified in 
terms of the expected speedup. 

Another vital property of our approach is simplicity. By construction, it al- 
lows the use of off-the-shelf image processing libraries for separable convolutions. 
Since such operations are essential in image processing, there exist many — of- 
ten highly optimized — implementations. Moreover, by directly working on gray 
values, separable RSVs can be mixed with unconstrained RSVs or SVs with- 
out affecting the homogeneity of the existing system. As a result, the required 
changes in existing code, such as for [6], are negligible. 

We are currently integrating our method into a complete face detection sys- 
tem. Future work includes a comprehensive evaluation of the system as well as 
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its extension to other detection problems such as component based object de- 
tection and interest operators. Furthermore, since SVMs are known to also yield 
good results in regression problems, the proposed method might provide a con- 
venient tool for speeding up different types of image processing applications that 
require real-valued (as opposed to binary) outputs. As a final remark, note that 
separable filters can be applied to higher dimensional grid data as well (e.g. vol- 
umes or time sequences of images), providing further possible applications for 
our approach. 
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Abstract. In this paper, we present a novel method for reducing the computational 
complexity of a Support Vector Machine (SVM) classifier without significant loss 
of accuracy. We apply this algorithm to the problem of face detection in images. To 
achieve high run-time efficiency, the complexity of the classifier is made dependent 
on the input image patch by use of a Cascaded Reduced Set Vector expansion of 
the S VM. The novelty of the algorithm is that the Reduced Set Vectors have a Haar- 
like structure enabling a very fast SVM kernel evaluation by use of the Integral 
Image. It is shown in the experiments that this novel algorithm provides, for a 
comparable accuracy, a 200 fold speed-up over the SVM and an 6 fold speed-up 
over the Cascaded Reduced Set Vector Machine. 



1 Introduction 

Detecting a specific object in an image is a computationally expensive task, as all the 
pixels of the image are potential object centres. Hence all the pixels have to be classified. 
This is called the brute force approach and is used by all the object detection algorithms. 
Therefore, a method to increase the detection speed is based on a cascaded evaluation 
of hierarchical filters: pixels easy to discriminate are classified by simple and fast filters 
and pixels that resemble the object of interest are classified by more involved and slower 
filters. This is achieved by building a cascade of classifier of increasing complexity. In 
the case of face detection, if a pixel is classified as a non-face at any stage of the cascade, 
then the pixel is rejected and no further processing is spent on that pixel. 

In the area of face detection, this method was independently introduced by Keren 
et al.[ 2], by Romdhani et al. [3] and by Viola and Jones [6]. These algorithms all use a 
20 x 20 pixel patch around the pixel to be classified. The main difference between these 
approaches lies in the manner by which the hierarchical filters are obtained, and more 
specifically, the criterion optimised during training. 

The detector from Keren et al. [2] assumes that the negative examples (i.e. the non- 
faces) are modeled by a Boltzmann distribution and that they are smooth. This assump- 
tion could increase the number of false positive in presence of a cluttered background. 
Here, we do not make this assumption: the negative example can be any image patch. 
Romdhani et al. [3] use a Cascaded Reduced Set Vectors expansion of a Support Vector 
Machine (SVM)[5j. The advantage of this detector is that it is based on an SVM classi- 
fier that is known to have optimal generalisation capabilities. Additionally, the learning 
stage is straightforward, automatic and does not require the manual selection of ad-hoc 
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parameters. At each stage of the cascade, one optimal 20 x 20 filter is added to the 
classifier. A drawback of these two methods is that the computational performances are 
not optimal, as at least one convolution of a 20 x 20 filter has to be carried out on the 
full image, 

Viola & Jones [6] use Haar-like oriented edge filters having a block like structure 
enabling a very fast evaluation by use of an Integral Image. These filters are weak, 
in the sense that their discrimination power is low. They are selected, among a finite 
set, by the Ada-boost algorithm that yields the ones with the best discrimination. Then 
strong classifiers are produced by including several weak filters per stage using a voting 
mechanism. A drawback of their approach is that it is difficult to appreciate how many 
weak filters should be included at one stage of the cascade. Adding too many filters 
improves the accuracy but deteriorates the run-time performances and too few filters 
favours the run-time performances but decrease the accuracy. The number of filters per 
stage is usually set such as to reach a manually selected false positive rate. Hence it 
is not clear that the cascade achieves optimal performances. Practically, the training 
proceeds by trial and error, and often, the number of filters per stage must be manually 
selected so that the false positive rate decreases smoothly. Additionally, Ada-boost is a 
greedy algorithm that selects one filter at a time to minimise the current error. However, 
considering the training as an optimisation problem over both filters and thresholds, then, 
the greedy algorithm clearly does not result in the global optimum in general. Another 
drawback of the method is that the set of available filters is limited and manually selected 
(they have a binary block like structure), and, again, it is not clear that these filters provide 
the best discrimination for a given complexity. Additionally, the training of the classifier 
is very slow, as every filter (and there are about 10 5 of them) is evaluated on the whole 
set of training examples, and this is done every time a filter is added to a stage of the 
cascade. 

In this paper, we present a novel face detection algorithm based on, and improving 
the run-time performance of the Cascaded Reduced Set Vector expansion of Romdhani 
et al. [3], Both approaches benefit from the following features: (i) They both leverage on 
the guaranteed optimal generalisation performance of an SVM classifier, (ii) The SVM 
classifier is approximated by a Reduced Set Vector Machine (see Section 2) that provides 
a hierarchy of classifiers of increasing complexity, (iii) The training is fast, principled 
and automatic, as opposed to the Viola and Jones method. The speed bottleneck of [3] is 
that the Reduced Set Vectors (RSVs) are 20 x 20 image patches for which the pixels can 
take any value (see Section 2), resulting in a computationally expensive evaluation of 
the kernel with an image patch. Here we constraint the RSVs to have a Haar-like block 
structure. Then, similarly to Viola & Jones [6], we use the Integral Image to achieve 
very high speed-ups. So, this algorithm can be viewed as a combination of the good 
properties of the Romdhani et al. detector (guaranteed optimal generalisation, fast and 
automatic training, high accuracy) and of the Viola & Jones detector (high efficiency). 

In this paper, we choose to start from an optimal detector and improve its run-time 
performance by making its complexity dependent on the input image patch. This is in 
contrast with the Viola & Jones approach that starts from a set of faster weak classi- 
fiers which are selected and combined to increase accuracy. This is a major conceptual 
distinction whose thorough theoretical comparison is still to be made. 
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Section 2 of this paper reviews the SVM and its Reduced Set Vector expansion. 
Section 3 details our novel training algorithm that constructs a Reduced Set Vectors 
expansion having a block-like structure. It is shown in Section 4 that the new expansion 
yields a comparable accuracy to the SVM while providing a significant speed-up. 

2 Nonlinear Support Vector Machines and Reduced Set Expansion 

Support Vector Machines (SVM), used as classifiers, are now well-known for their good 
generalisation capabilities. In this section, we briefly introduce them and outline the 
usage of an approximation of SVMs called Reduced Set Vector Machines (RVM)[4]. 
RVM provide a hierarchy of classifier of increasing complexity. Their use for fast face 
detection is demonstrated in [3]. 

Suppose that we have a labeled training set consisting of a series of 20 x 20 image 
patches x* £ X (arranged in a 400 dimensional vector) along with their class label yi £ 
{ ± 1 } . Support Vector classifiers implicitly map the datax, into a dot product space F via 
a (usually nonlinear) map 0 : X — > F. x4 0(x). Often, F is referred to as the feature 
space. Although F can be high-dimensional, it is usually not necessary to explicitly 
work in that space [1]. There exists a class of kernels k(x, x') which can be shown to 
compute the dot products in associated feature spaces, i.e. fc(x, x') = (0(x), 0(x')). It 
is shown in [5] that the training of a SVM classifier provides a classifier with the largest 
margin, i.e. with the best generalisation performances for the given training data and 
the given kernel. Thus, the classification of an image patch x by an SVM classification 
function, with N s support vectors x, with non-null coefficients a, and with a threshold 
6, is expressed as follows: 



V = sgn 



f N x 

E 



ajfc(xi,x) + b 



(1) 



A kernel often used, and used here, is the Gaussian Radial Basis Function Kernel: 



fc(xj, x) = exp 




(2) 



The Support Vectors (SV) form a subset of the training vectors. The classification 
of one patch by an SVM is slow because there are many support vectors. The SVM can 
be approximated by a Reduced Set Vector (RVM) expansion [4]. We denote by Fi £ F, 
the vector normal to the separating hyperplane of the SVM, and by )P' N £ F, the vector 
normal to the RVM with N z vectors: 



N x N z 

= E Q *^( x *)’ ^N z = E with < N x (3) 

i=l i=l 



The z, are the Reduced Set Vectors and are found by minimising ||<Zh — F' N || 2 with 
respect to z. ( and to Q t . They have the particularity that they can take any values, they 
are not limited to be one of the training vectors, as for the support vectors. Hence, 
much less Reduced Set Vectors are needed to approximate the SVM. For instance, an 
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SVM with more than 8000 Support Vectors can be accurately approximated by an RVM 
with 100 Reduced Set Vectors. The second advantage of RVM is that they provide a 
hierarchy of classifiers. It was shown in [3] that the first Reduced Set Vector is the 
one that discriminates the data the most; and the second Reduced Set Vector is the 
one that discriminates most of the data that were mis-classified by the first Reduced 
Set Vector, etc. This hierarchy of classifiers is obtained by first finding j3\ and zi that 
minimises ||$h — /?i^(zi) || 2 . Then the Reduced Set Vector k is obtained by minimising 

ll^fc - Ac^Ofc) II 2 , where <E k = iq - YhZi A^O,:)- 

Then, Romdhani et al. used in [3] a Cascaded Evaluation y based on an early rejection 
principle, to that the number of Reduced Set Vectors necessary to classify a patch is, on 
average, much less than the number of Reduced Set Vectors, N z . So, the classification 
of a patch x by an RVM with j Reduced Set Vector is: 

Vj (x) = sgn A>A(x , + bj^J (4) 

This approach provides a significant speedup over the SVM (by a factor of 30), but is 
still not fast enough, as the image has to be convolved, at least by a 20 x 20 filter. The 
algorithm presented in this paper improves this method because it does not require to 
perform this convolution explicitly. Indeed, it approximates the Reduced Set Vectors 
by Haar-like filters and compute the evaluation of a patch using an Integral Image of 
the input image. An Integral Image [6] is used to compute the sum of the pixels in a 
rectangular area of the input image in constant time, by just four additions. They can be 
used to compute very efficiently the dot product of an image patch with an image that 
has a block-like structure, i.e. rectangles of constant values. 

3 Reduced Set Vector with a Haar-Like Block Structure 

As it is explained in Section 2, the speed bottleneck of the Cascaded Reduced Set Vector 
classifier is the computation of the kernel of a patch with a Reduced Set Vector (see 
Equation (4)). In the case of the Gaussian kernel, that we selected, the computational 
load is spent in evaluating the norm of the difference between a patch, x and a Reduced 
Set Vector, z k (see Equation (2)). This norm can be expanded as follows; 

llx-Zfcll = x'x - 2x'z fc + z' fc z fc (5) 

As Zfc is independent of the input image, it can be pre-computed. The sum of square of the 
pixels of a patch of the input image, x'x is efficiently computed using the Integral Image 
of the squared pixel values of the input image. As a result, the computational load of this 
expression is determined by the term x'z/,.. We observe that if the Reduced Set Vector 
z/. has a block-like structure, similar to the Viola & Jones filters, then this operation can 
be evaluated very efficiently by use of the Integral Image: if z k is an image patch with 
rectangles of constant (and different) grey levels then the dot product is evaluated by 

4 additions per rectangle and one multiplication per grey level value (Note that many 
rectangles may have the same grey level). Hence we propose to approximate the SVM 
by a set of Reduced Vectors that do not have any values but have a block-like structure, 
as seen in Figure 1 . 
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Fig-1 . First Reduced Set Vector of an S VM face classifier and its block-like approximation obtained 
by the learning algorithm presented in this section. 



The block-like Reduced Set Vectors must (i) be a good approximation of the SVM, 
hence minimise 1 1 1 1 , and (ii) have a few rectangles with constant value to provide 

a fast evaluation. Hence, to obtain the k th Reduced Set Vector instead of minimising just 
\\Ek — /3fc^(zfc) || as in [3], we minimise the following energy with respect to [3 and to 



E k = \\&k - /3fc^(zfc)|| 2 + w(4n + v), (6) 

where n is the number of rectangles, v is the number of different grey levels in z fc and w 
is a weight that trades off the accuracy of the approximation with the run-time efficiency 
of the evaluation of Zk with an input patch. 

To minimise the energy Ek, we use Simulated Annealing which is a global optimi- 
sation method. The starting value of this optimisation is the result of the minimisation 
of W&k — /3/,.<£(zfc)|| 2 , i.e. the Reduced Vector as computed in [3], To obtain a block-like 
structure the following two operations are performed, as shown in Figure 2: 

1 . Quantisation: The grey values of z/ c are quantised into v bins. The threshold values 
of this quantisation are the } percentiles of the grey values of z y. . For instance if 
v — 2, then z/. will be approximated by 2 grey levels, and the 50% percentile is 
used as a threshold: the pixels of z& for which the grey values are lower than the 
threshold are set to the mean of these pixels. The result of this quantisation on two 
Reduced Set Vectors is shown in the second column of Figure 2. 

2. Block structure generation: The quantisation reduces the number of grey level 
values used to approximate a Reduced Set Vector z/ c , but it does not produce a block 
structure. To obtain a block structure two types of morphological operations are 
used: opening (a dilatation followed by an erosion) or closing (an erosion followed 
by a dilatation). The type of morphological operations applied is denoted by M = 
{opening, closing}, and the size of the structuring elements is denoted by S. The 
coordinates of the rectangles are obtained by looking for the maximum width and 
height of disjoined rectangular areas at the same grey level. 

Simulated Annealing is used to obtain a minimum of the energy Ek by selecting the 
parameters v, M and S that minimises Ek- As these new Reduced Set Vectors have a 
Haar-like structure, we call them Haar-Reduced Set Vectors, or H-RVM. 

Note that the thresholds bi are chosen to minimise the False Rejection Rate (FRR), 
i.e. the number of face patches classified as non-face, using of the Receiver Operating 
Characteristic (ROC) (computed on the training set), as done in [3]. 
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RSV Quantised After Morph. Op. H-RSV 




Fig. 2. Example of the Haar-like approximation of a face and an anti-face like RSV (1 st column). 
2 nd column', discretized vectors by four gray levels, 3 rd column', smoothed vector by morpholog- 
ical filters, 4 th column'. H-RSV’s with computed rectangles. 

3.1 Detection Process - Cascade Evaluation 

Thanks to the Haar-like approximated RVM the kernel is computed very efficiently with 
the Integral Image. To classify an image patch, a cascaded evaluation based on an early 
rejection rule is used, similarly to [3]: We first approximate the hyperplane by a single 
H-RSV zi, using the Equation (4). If yi is negative, then the patch is classified as a non- 
face and the evaluation stops. Otherwise the evaluation continues by incorporating the 
second H-RSV z 2 . Then, again if it is negative, the patch is classified as a non-face and 
the evaluation stops. We keep on making the classifier more complex by incorporating 
more H-RSV’s and rejecting as early as possible until a positive evaluation using the last 
H-RVM Zjv 2 is reached. Then the full SVM is used with (1). 

4 Experimental Results 

We used a training set that contains several thousand images downloaded from the 
World Wide Web. The training set includes 3500, 20 x 20, face patches and 20000 non- 
face patches and, the validation set, 1000 face patches, and 100,000 non-face patches. 
The SVM computed on the training set yielded about 8000 support vectors that we 
approximated by 90 Haar-like Reduced Set Vector by the method detailed in the previous 
section. 

The first plot of Figure 3 shows the evolution of the approximation of the SVM by the 
RVM and by the H-RVM (in terms of the distance P — \P') as a function of the number 
of vectors used. It can be seen that for a given accuracy more Haar-like Reduced Set 
Vectors are needed to approximate the SVM than for the RVM. However, as is seen of 
the second plot, for a given computational load, the H-RVM rejects much more non-face 
patches than the RVM. This explains the improved run-time performances of the H- 
RVM. Additionally, it can be seen that the curve is more smooth for the H-RVM, hence 
a better trade-off between accuracy and speed can be obtained by the H-RVM. Figure 4 
shows an example of face detection in an image using the H-RVM. As the stages in 
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Fig. 3. Left: i — distance {left) as function of the number of vectors N z for the RVM {dashed 
line, and the H-RVM {solid line). Right: Percentage of rejected non-face patches as a function of 
the number of operations required. 




Fig. 4. Input image followed by images showing the amount of rejected pixels at the 1 st , 3 rd and 
50 th stages of the cascade. The white pixels are rejected and the darkness of a pixel is proportional 
to the output of the H-RVM evaluation. The penultimate image shows a box around the pixels 
alive at the end of the 90 H-RVM and the last image, after the full SVM is applied 



the cascade increase fewer and fewer patches are evaluated. At the last H-RVM, only 5 
pixels have to be classified using the full SVM. 

Figure 5 shows the ROCs, computed on the validation set, of the SVM, the RVM 
(with 90 Reduced Set Vector) and the H-RVM (with 90 Haar-like Reduced Set Vectors). 
It can be seen that the accuracies of the three classifiers are similar without (left plot) and 
almost equal with (right plot) the final SVM classification for the remaining patches. 

Table 1 compares the accuracy and the average time required to evaluate the patches 
of the validation set. As can be seen, the novel H-RVM approach provides a significant 
speed-up (200-fold over the SVM and almost 6-fold over the RVM), for no substantial 
loss of accuracy. 



Table 1. Comparison of accuracy and speed improvement of the H-RVM to the RVM and SVM 



method 


FRR 


FAR 


time per patch in /rs 


SVM 


1.4% 


0.002% 


787.34 


RVM 


1.5% 


0.001% 


22.51 


H-RVM 


1.4% 


0.001% 


3.85 
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ROC(wHoutSVM) 




ROC<v*i»lSVM> 




Fig. 5. ROCs for a set of the S VM, the RVM (with 90 Reduced Set Vectors) and the H-RVM (with 
90 Haar-like Reduced Set Vectors) (left) without and (right) with the final SVM classification for 
the remaining patches. The FAR is related to non-face patches 



Another source of speed-up in favour of the H-RVM over the SVM and the RVM 
is to detect faces, that is not shown in Table 1 , so that no image pyramid is required to 
perform detection at several scales for the H-RVM. Indeed, thanks to the Integral Image 
implementation of the kernel, the classifier can be evaluated at different sizes in constant 
time, without having to rescale the input image. 



5 Conclusion 

In this paper we presented a novel efficient method for detecting faces in images. In our 
approach we separated the problem of finding an optimally classifying hyper-plane for 
separating faces from non-faces in image patches from the problem of implementing a 
computationally efficient representation of this optimal hyper-plane. This is in contrast 
to most methods where computational efficiency and classification performance are op- 
timised simultaneously. Having obtained an hyper-plane with an optimal discrimination 
power but with a quite computational expensive SVM-classifier, we then concentrated 
on a reduction of the computational complexity for representing this hyper-plane. We 
developed a cascade of computationally efficient classifiers approximating the optimal 
hyper-plane. Computational efficiency is improved by transforming the feature vectors 
into block structured Haar-like vectors that can be evaluated extremely efficiently by 
exploiting the Integral Image method. 
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Abstract. An experimental comparison of ‘Edge-Element Association 
(EEA)’ and ‘Marginalized Contour (MCo)’ approaches for 3D model- 
based vehicle tracking in traffic scenes is complicated by the different 
shape and motion models with which they have been implemented orig- 
inally. It is shown that the steering-angle motion model originally asso- 
ciated with EEA allows more robust tracking than the angular-velocity 
motion model originally associated with MCo. Details of the shape mod- 
els can also make a difference, depending on the resolution of the images. 
Performance differences due to the choice of motion and shape model 
can outweigh the differences due to the choice of the tracking algorithm. 
Tracking failures of the two approaches, however, usually do not happen 
at the same frames, which can lead to insights into the relative strengths 
and weaknesses of the two approaches. 



1 Introduction 

Detection and tracking of visible objects constitute standard challenges for the 
evaluation of image sequences. If the camera is stationary, then tracking by 
change detection (e.g. [11,7,5]) is feasible in real time, but can generate prob- 
lems when the images of vehicles in a road traffic scene overlap significantly. 
Switching from 2D tracking in the image plane to 3D tracking in the scene do- 
main often results in a substantially reduced failure rate, because more prior 
knowledge about the size and shape of objects to be tracked, their motion, and 
their environment, can be brought to bear. 

Very few 3D model-based trackers have been developed for road vehicles. In 
order to understand the strengths and weaknesses of these trackers, they should 
be compared under a variety of driving conditions, camera geometry, illumina- 
tion, and occlusion conditions. Such comparisons are complicated by the fact 
that differences in performance can be due to differences between initialisation 
methods, shape models, motion (dynamical) models, or pose-refinement meth- 
ods. Pose-refinement methods can be further decomposed into the criterion used 
to evaluate the vehicle’s pose and the method used to optimise the evaluation 



C.E. Rasmussen et al. (Eds.): DAGM 2004, LNCS 3175, pp. 71-78, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




72 



H. Dahlkamp et al. 



criterion. Differences between pose-refinement methods are perhaps of greater 
scientific interest, yet differences between shape and/or motion models can be 
equally important for robust tracking. In the following, the term “approach” will 
be used as shorthand for pose-refinement method. 

Here, two approaches to vehicle tracking with 3D shape and motion models 
are analysed: an approach based on optimising a marginalised likelihood ratio [8, 
9] and an approach based on edge-element association [3]. A first comparison of 
these approaches has been reported in [1], but was carried out with two different 
implementations and different dynamical models. Experience with these initial 
experiments stimulated us to integrate modules corresponding to each approach 
within the same system. In this manner, the same shape and motion models can 
be used for both tracking approaches. 

The next section outlines the test system (see also [2,10]) and the two tracking 
approaches. We then discuss the effects of different vehicle shape and motion 
models in Section 3, using the rheinhafen image sequence [12]. The insights 
gained thereby are exploited in Section 4 to compare the two approaches on a 
more complex image sequence (the ‘dt_passat03’ sequence, see [12]) with more 
vehicles being seen from a larger distance. Experiments have been carried out on 
the well-known PETS-2000 [13] image sequence as well. Results are comparable 
to those obtained for the rheinhafen sequence and will be presented elsewhere. 



2 The Approaches to Be Compared 

The tracking experiments were performed within MOTRIS (Model-based 
Tracking in Image Sequences), a framework implemented in Java (see, e. g., 
[10,2]), partially based on modules developed by other members of our labora- 
tory and released under the GNU GPL [14]. In the experiments reported in this 
contribution, all vehicles were initialised interactively in order to avoid errors by 
automatic initialisation which might complicate the interpretation of the results. 



2.1 Edge Element Association (EEA) 



In the EEA approach, the boundaries between object and background in the 
image plane, as well as the internal boundaries of the object, are assumed to 
generate edge elements in the image. An edge element e = (u e , v e , (j> e ) T rep- 
resents a local maximum of the gradient norm in gradient direction </> at the 
position ( u e ,v e ). As illustrated in Figure 1 (left), the distance measure d m be- 
tween an edge element e and a projected model segment m depends on (i) the 
Euclidean distance b between the edge element and the model segment and (ii) 
the difference A between the gradient direction of the edge element and the 
normal to the model segment: 



^m(^i ^) 



b 

cos A 



(1) 



It is assumed that d m is normally distributed with zero mean and variance 
a 2 , implying a Mahalanobis distance which follows a y 2 (l) distribution. Edge 
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elements which exceed the (1 — a) quantile of the Mahalanobis distance are 
rejected. Edge elements which can be assigned to several model segments are 
assigned to the segment associated with the smallest Mahalanobis distance. The 
object pose is refined by iteratively minimising the sum of squared distances 
between edge elements and projected model segments. Thus, the optimisation 
of the model pose is reduced to an iterated least-squares problem. 




Fig. 1 . Left: Edge elements and their distances from the closest model segment in the 
direction of the edge gradient. Right: Measurement points on the normals to the same 
model segment and estimated displacements for each normal, in the MCo approach. 



2.2 Marginalised Contours (MCo) 

The MCo approach is based on a statistical model of image generation and 
Maximum a Posteriori estimation of the vehicle pose. The principal assumptions 
underlying the MCo statistical model are 

1. Grey value differences AI between adjacent pixels from the same region have 
a prior probability density /l(Z\/) which is sharply peaked around zero, while 
grey value differences between pixels across an edge (object boundary) have 
a broad probability density /A(Z\/) which may be considered to be uniform. 

2. The visible object boundaries occur at distances, from the projection of the 
model into the image plane, which are randomly distributed with a Gaussian 
probability density fu with zero mean and variance a 2 . 

Assumption (1) implies that the likelihood ratio f e{AI) / f l{AI) can be used 
as an estimator of whether a projected model edge falls between the pixels whose 
grey values are sampled. When assumption (2) is taken into account, it becomes 
necessary to integrate (marginalise) the likelihood ratios over all possible dis- 
tances between projected model segments and object boundaries in the image 
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plane. By approximating the integration with a discrete summation, we obtain 
an equation for the marginalised likelihood ratio r k at a sample point k: 



Tk 



E 



f E (AIk,j) 

f L (AI k ,j) 



f D (jAv) 



(2) 



where the sum is over a number of measurement points indexed by j, equally 
spaced by a distance v on the normal line to the sample point, see Figure 1 
(right) . By summing logarithms of marginalised likelihood ratios over all sample 
points k , we arrive at the evaluation function E = ]T) i T 0 g r fc which is to be 
maximised w.r.t. the vehicle’s pose parameters. The Hessian of this evaluation 
function is not guaranteed to be negative definite and therefore the simple New- 
ton method is not applicable for maximisation. However, a number of alternative 
gradient-based methods can be applied, for example the EM method used in [8]. 



3 Experiments Using the rheinhafen Image Sequence 

In order to set the stage, we investigate how the two tracking approaches perform 
on the rheinhafen sequence with different shape and motion models. 

In the original report [8], the MCo approach relied on an ‘angular- velocity’ 
motion model: the state of the vehicle to be tracked is described by position 
on the ground plane, orientation, tangential velocity, angular velocity, and tan- 
gential acceleration. An alternative ‘steering-angle’ motion model [4] was used 
in some of the experiments to be described in the following. This alternative 
includes the same state variables except for angular velocity and tangential ac- 
celeration, which are replaced by a steering angle and thus additional nonlineari- 
ties. Using the steering angle implies that the vehicle orientation change depends 
on the vehicle speed - this model does not allow an orientation change at zero 
tangential velocity. Since such a behaviour reflects driving physics more accu- 
rately, the steering angle model is expected to perform better on vehicles with 
changing speed. 



3.1 Influence of Wheel Arches in the Vehicle Model 

Initial tracking experiments indicated that the presence or absence of wheel 
arches constitutes an important feature of the shape model. This is confirmed 
by more systematic experiments. 

Figure 2 illustrates the impact of the shape model for tracking of a notclrback 
using different motion models and tracking approaches. The model has been 
manually optimised for this particular vehicle. Except in one case (using the 
EEA algorithm and the angular velocity motion model, discussed above and in 
Section 3.2), tracking results improve when using a model with wheel arches. 
The choice of the less accurate model without wheel arches leads to tracking 
failure for the combinations of EEA/steering angle and MCo/angular velocity. 
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‘■—■d t* 



MCo - steering angle 



a'-*, 

MCo - angular velocity 



Fig. 2. Comparison of tracking results using vehicle model with (green) and without 
(red) wheel arches for a notchback at half-frame 2320 of the rheinhafen sequence for 
all combinations of tracking approaches and motion models. 



3.2 Angular- Velocity Versus Steering- Angle Motion Model 

Further experiments were carried out to compare the two motion models. Fig- 
ure 3 shows tracking results on the same vehicle as in Figure 2 obtained for all 
possible combinations of tracking approaches and shape models. 

As expected, introduction of the steering angle improves tracking perfor- 
mance. Only for the combination of the EEA tracking approach with a no- 
wheel-arch geometric model, the use of the steering-angle motion model leads to 
worse tracking results than the angular- velocity model. 

Note that in all the experiments reported above, the same parameters were 
used for shape and motion models except, of course, for the shape parameters of 
the wheel arches and for the Kalman process noise associated with state variables 
present only in one of the two motion models. 

Moreover, it can be seen that the choice of shape and motion model can 
outweigh the choice of the tracking approach (compare, e.g., top left panel (EEA 
/ steering angle / wheel arches) and bottom right panel (MCo / angular velocity 
/ no wheel arches) in Figure 2). 

In conclusion, the experiments discussed so far suggest that the combination 
of a shape model with wheel arches and a motion model based on the steering 
angle - which has not been attempted before - is most efficient for model-based 
tracking in image sequences like those studied here. 
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EEA - wheel arches EEA - no wheel arches 




MCo - wheel arches MCo - no wheel arches 



Fig. 3. Comparison of tracking results using steering angle (green) and angular velocity 
(red) motion model for a notchback at half-frame 2380 (lower left panel) and half- 
frame 2320 (other panels) of the rheinhafen sequence for all combinations of tracking 
approaches and shape models. 



4 Experiments with a Traffic-Intersection Sequence 

Figure 4 (left) shows a representative frame from the sequence ‘dt_passat03’ (see 
[12]). Here, vehicle images are smaller than in the rheinhafen sequence so that 
the presence or absence of wheel arches is no longer very important: in fact, in 
many cases the wheel arches can not be seen at all, due to the viewing distance 
and especially due to the viewing angles. 

Figure 4 (right) summarises tracking results obtained by the ‘optimal’ com- 
bination of shape and motion models (wheel arches in the shape model and 
steering-angle motion model). The results obtained with either the EEA or the 
MCo approaches are not as good as those reported in [3]. However, this com- 
parison is compromised by the fact that the latter results were obtained by a 
tracker including optical flow in addition to the EEA-approach: the integration 
of optical flow information reportedly leads to more robust tracking. 

Vehicles 1, 6, 7, 13, and 15 are tracked successfully with the MCo approach 
while visible. Vehicles 1, 2, 6, 7, 9, 10, 12, 13, and 15 are tracked successfully 
with the EEA approach; vehicle 14 is tracked for much longer with the MCo 
approach, whereas vehicle 5, 9, and 10 can be tracked for much longer with 
the EEA approach. Tracking failures occur with both approaches for six other 
vehicles. Only five vehicles (1, 6, 7, 13, and 15) are tracked successfully by both 
approaches. 
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Fig. 4. Left: a typical frame of the dt_passat03 test sequence. Right: a graph illustrat- 
ing the visibility of vehicles 1 to 15 as a function of frame number and the intervals 
during which the vehicles are successfully tracked by the EEA and MCo approaches, 
as determined by visual inspection. 



5 Conclusions 

Our experiments aim, first, at clarifying the effects of implementation alterna- 
tives and separate these effects from the effects of more fundamental assumptions 
underlying the two approaches under investigation; second, at gaining insight 
into the differential strengths and weaknesses of the two approaches when ap- 
plied to challenging traffic sequences. 

The experiments reported in this paper support the following conclusions: 

1. Seemingly innocuous differences in the motion models used may, under cer- 
tain circumstances (see Figures 2 and 3), be more important than the dif- 
ference between the EEA and the MCo approaches. 

2. Differences between the shape models can be more important than differ- 
ences between motion models, again independently of the approach (EEA 
or MCo). 

3. One must be careful, therefore, regarding tracking approaches carried out 
with different shape or motion models. 

4. Similar to the results reported in [1], it is inconclusive which tracking ap- 
proach performs best. The rheinhafen sequence seems to favor the MCo 
algorithm while the dt_passat03 sequence gives an edge to the EEA ap- 
proach. 

5. Failures of the two approaches at different frames in the dt_passat03 sequence 
suggest that, even though the two approaches use the same information (grey 
value gradients in the image plane) , a combination of their advantages might 
result in more robust tracking. 

As a consequence, our experiments suggest to analyse the influence of other 
alternatives and parameter settings prior to the formulation of a ‘final verdict’ 
about a particular approach. 



78 



H. Dahlkamp et al. 



Acknowledgements. The authors gratefully acknowledge partial support of 

these investigations by the European Union FP5-project CogViSys (IST-2000- 

29404) 

References 

1. H. Dahlkamp, A. Ottlik, and H.-H. Nagel: Comparison of Edge- driven Algorithms 
for Model-based Motion Estimation. In Proc. Workshop on Spatial Coherence for 
Visual Motion Analysis (SCVMA’04), 15 May 2004, Prague, Czech Republic. 

2. H. Dahlkamp: Untersuchung eines Erwartungswert-Maximierung (EM)-Kontur- 
Algorithmus zur Fahrzeugverfolgung. Diplomarbeit, Institut fur Algorithmen und 
Kognitive Systeme, Fakultat fur Informatik der Universitat Karlsruhe (TH), Jan- 
uar 2004. 

3. M. Haag and H.-H. Nagel: Combination of Edge Element and Optical Flow Esti- 
mates for 3D-Model-Based Vehicle Tracking in Traffic Image Sequences. Interna- 
tional Journal of Computer Vision 35:3 (1999) 295-319. 

4. H. Leuck, and H.-H. Nagel: Automatic Differentiation Facilitates OF- Integration 
into Steering-Angle-Based Road Vehicle Tracking. In: Proc. IEEE Computer Soci- 
ety Conference on Computer Vision and Pattern Recognition (CVPR’99), 23-25 
June 1999, Fort Collins, Colorado/USA; IEEE Computer Society Press, Los Alami- 
tos/CA, Vol. 2, pp. 360-365. 

5. D.R. Magee: Tracking Multiple Vehicles Using Foreground, Background and Motion 
Models. Image and Vision Computing 22:2 (2004) 143-155. 

6. H.-H. Nagel, M. Middendorf, H. Leuck und M. Haag: Quantitativer Vergleich zweier 
Kinematik-Modelle fur die Verfolgung von Straflenfahrzeugen in Video- Sequenzen. 
In S. Posch and H. Ritter (Hrsg.), Dynamische Perzeption , Proceedings in Artificial 
Intelligence Vol. 8, Sankt Augustin: infix 1998, pp. 71-88 (in German). 

7. A. E. C. Pece: Generative-model-based Tracking by Cluster Analysis of Image Dif- 
ferences. Robotics and Autonomous Systems 39:3-4 (2002) 181-194. 

8. A.E.C. Pece and A.D. Worrall: Tracking with the EM Contour Algorithm. Proceed- 
ings of the 7th European Conference on Computer Vision 2002 (ECCV2002), 28-30 
May 2002, Copenhagen, Denmark; A. Heyden, G. Sparr, M. Nielsen, P. Johansen 
(Eds.), LNCS 2350, Springer- Verlag, Berlin-Heidelberg-New York (2002), pp. 3-17. 

9. A.E.C. Pece: The Kalman-EM Contour Tracker. Proceedings of the 3rd Workshop 
on Statistical and Computational Theories of Vision (SCTV 2003), 12 October 
2003, Nice, France; 

http: //www. stat .ucla.edu/~yuille/meetings/2003_workshop.php. 

10. P. Reuter: Nutzung des Optischen Flusses bei der modellgestiitzten Verfolgung von 
Fuflgangern in Videobildfolgen. Diplomarbeit, Institut fiir Algorithmen und Kog- 
nitive Systeme, Fakultat fiir Informatik der Universitat Karlsruhe (TH), Oktober 
2003. 

11. C. Stauffer and W.E.L. Crimson: Learning Patterns of Activity Using Real-Time 
Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 
22:8 (2000) 747-757. 

12. http : //i21www. ira.uka.de/ image_sequences/ 

13. ftp : //pets2001 . cs .rdg . ac . uk/PETS2000/test -images/ 

14. http : //kogs . iaks .uni-karlsruhe . de/motris/ 




Efficient Computation of Optical Flow 
Using the Census Transform 



Fridtjof Stein 



DaimlerChrysler AG, 

Research and Technology, 

D-70546 Stuttgart, Germany 
Fridt j of . SteinODaimlerChrysler . com 



Abstract. This paper presents an approach for the estimation of visual motion 
over an image sequence in real-time. A new algorithm is proposed which solves the 
correspondence problem between two images in a very efficient way. The method 
uses the Census Transform as the representation of small image patches. These 
primitives are matched using a table based indexing scheme. We demonstrate the 
robustness of this technique on real-world image sequences of a road scenario 
captured from a vehicle based on-board camera. We focus on the computation of 
the optical flow. Our method runs in real-time on general purpose platforms and 
handles large displacements. 



1 Introduction 

Recovering motion information from a visual input is a strong visual cue for under- 
standing structure and three-dimensional motion. Visual motion allows us to compute 
properties of the observed three-dimensional world without the requirement of extensive 
knowledge about it. 

In this paper we propose a strategy to efficiently determine the correspondences 
between consecutive image frames. The goal is to retrieve a set of promising image-to- 
image correspondence hypotheses. These correspondences are the basis for the compu- 
tation of the optical flow. First we select a robust and descriptive primitive type. Then we 
match all primitives in one image with all the primitives in the consecutive image with 
a table based structural indexing scheme. Using structure as a means of correspondence 
search yields a matching method without search area limits. 

This requires a computational complexity of 0(n) with ri the number of pixels in 
the image. It is obvious that the number of matches has a complexity of 0(n 2 ) which is 
an intractable high number for a typical image with n = const * 10 5 pixels. 

We describe a pruning technique based on discriminative power to reduce this com- 
plexity to 0(C * n). C is a small constant value. Discriminative power in this context is 
the descriptiveness of a primitive. The discriminative power of a primitive is inverse pro- 
portional to its occurrence frequency in an image. Temporal constraints further reduce 
the number of correspondence hypotheses to the resulting optical flow. 

The structure of this paper is as follows : In the next section we give a brief overview of 
some previous work. In section 3 we discuss the feature type which we use: the Census 
Transform. The algorithm is presented in section 4. Section 5 explains the involved 
parameters. Some results illustrate the performance in section 6. 
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2 Related Work 

In recent years a large amount of different algorithms for the computation of optical flow 
was developed. A good overview was given by Barron et al. [1] and by Cedras et al. [2], 
Barron et al. classify the optical flow techniques in four classes: differential methods, 
region-based matching, energy-based techniques, and phase-based approaches. All of 
these methods have to handle trade-offs between computational efficiency, the maximum 
length of the flow, the accuracy of the measurements, and the density of the flow. In order 
to tackle the real-time constraint (see e.g. [3,4,5, 6]), several strategies were followed: 

- Strong limitations on the maximum allowed translation between frames increases 
efficiency even for correlation based approaches. 

- Image pyramids or coarse sampling lower the computational burden. E.g. using 
only selective features, and tracking them over an image sequence requires little 
computational power. 

- Some methods were implemented on dedicated hardware to achieve real-time per- 
formance (see e.g. [3,6]). 

Our method puts an emphasis on computational efficiency, and it allows a large span 
of displacement vector lengths. The accuracy is limited to pixel precision, and the density 
is texture driven. In areas with a lot of structural information the density is higher than 
in areas with little texture. 



3 The Census Transformation 



A very robust patch representation, the Census Operator, was introduced by Zabih and 
Woodfill [7]. It belongs to the class of non-parametric image transform-based matching 
approaches [8]. 

The Census Transform R(P) is a non-linear transformation which maps a local 
neighborhood surrounding a pixel P to a binary string representing the set of neighboring 
pixels whose intensity is less than that of P. Each Census digit £(P, P') is defined as 



app') 



0 P> P' 

1 p < p’ 



This is best demonstrated with an example. The left table shows the intensity values, 
the center table illustrates the Census values, and the right number is the corresponding 
clockwise unrolled signature vector. 
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We extended the Census Transform by introducing the parameter e in order to rep- 
resent “similar” pixel. This results in ternary signature vector digits. 
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We decided to use the Census Transform as the basic primitive due to its robustness 
with respect to outliers, and its simplicity to compute. In addition the Census primitive 
is highly discriminative. A signature vector of length c (its cardinality) represents 3 C 
different patches. This implicit discriminative power is the key to the proposed algorithm. 
The shape of the primitive, rectangular, circular, or star-like, is not significant. 

While other approaches [7] use the Census Transform for correlation based matching 
we use it as an index into a table-based indexing scheme as described in the next section. 

The correlation based approaches use the Hamming Distance for deciding whether 
two patches are similar. Our approach requires a Hamming Distance of zero. Therefore 
we lose a certain amount of “near matches” due to the binning effect. Beis and Lowe [9] 
discuss the issue of indexing in higher dimensions in their article. The exact analysis of 
this loss is part of our current research. 




Fig. 1. The core algorithm with the resulting set of correspondence hypotheses. 



4 The Algorithm 

4.1 Finding Correspondences 

All images are slightly low-pass filtered with a 3x3 mean filter before applying the 
algorithm below. All tables are implemented as hash-tables. The following steps are 
illustrated in Figure 1 : 

1. Scan image frame 1. Compute for every pixel Pi = the signature £(P^). 

2. The ternary ) is interpreted as a decimal number and serves as the key to a 
table 1 in which the corresponding coordinate (u) . vj ) is stored. 
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3. Scan image frame 2. Compute for every pixel Pj = (u*, v'j) the signature £(P?). 

4. Look for every £ (Pf) in table 1 whether there are one or more entries with the same 
signature vector. 

5. All the resulting ( uj,vj ) •*-»• (tt|, v |) pairs represent correspondence hypotheses. 

6. If a consecutive image (e.g. in optical flow) has to be analyzed create a table 2 from 
all the £{Pj)- 

It is obvious that this procedure leads to a huge amount of correspondence hypotheses. 
E.g. a patch with uniform intensity values and a center pixel U results in £([/) = l c . In 
our test images such uniform patches account for at least 10 4 such patches. Therefore 
we have to analyze at least 10 8 correspondence hypotheses. 

The question is: How can we handle the explosion of the number of correspondence 
hypotheses? 

In our work we use filters. They are labeled in Figure 1 with FI to F3. 

FI Some patch patterns do not contribute to any meaningful correspondence computa- 
tion. They are filtered out early. Typical patches are the above mentioned uniform 
patch, or all the patches which are related to the aperture problem. An example is 
depicted in Figure 2. FI compares the retrieved signature vector with a list (called 
the FI -list) of candidates, and inhibits the further processing if found in this table. 
This list is learned on example sequences. 

F2 F2 introduces the parameter max-discriminitive .power (= mdp). F2 inhibits the 
correspondence hypothesis generation if a matching table entry has more than mdp 
elements. Hypotheses are only generated if there are fewer elements. F2 filters out 
patches which were not yet stored in the FI -list. 

F3 The third filter uses illumination and geometric constraints to filter out unlikely 
hypotheses. Without loss of generality we typically allow an intensity change of the 
center pixel of 20% and we limit the displacement vector length to 70 pixel. 



4.2 Temporal Analysis 

The algorithm in the previous section produces a large set of correspondence hypothe- 
ses. They are created based on similarity, not based on any flow constraints. Using the 
correspondence hypotheses of the two previous frames (Figure 3), and attaching a cer- 
tain inertia to every displacement vector, we get a geometrical continuation constraint, 
illustrated in Figure 4. 

Such an inertia constraint filters out all flow hypotheses which have no predecessor 
among the correspondence matches. A valid predecessor has 

- a similar direction angle, 

- it is located close to the actual flow hypothesis (with respect to a moderate catch 
radius r as depicted in Figure 4), 

- and its vector length has not changed too much. 

Veenman et al. [ 1 0] address the temporal benefits extensively in their paper. It is important 
to note that our approach allows multiple associations as shown in Figure 5. We apply 
no heuristics to get a one-to-one correspondence. It is for the user to decide which 
interpretation serves him best. 
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Fig. 2. Representation of an edge with Fig. 3. Using temporal constraints, 

the signature vector 0002 2220 0002 
2220 and a frequency of occurrence of 
1875 in Figure 6 




Fig. 4. Two consecutive correspondence 
vectors: vi originates from a correspon- 
dence between frame 1 and frame 2, V 2 
originates from a match between frame 2 
and frame 3. 
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Fig. 5. Multiple interpretations have to 
be resolved later. 



5 Parameters 

In the last sections several parameters were introduced. Their effect on the overall per- 
formance is as follows: 

Census e : This parameter was introduced in Section 3. e represents the similarity be- 
tween Census features. If e = 0, the signature vector becomes more sensitive to 
noise. If e = grey max , then £(/') = l c for all pixel f\. All patches have the same 
representation, and no discriminative power. 

A typical value for 8 bit and 12 bit images is e = 16. 

Census cardinality c: The trade-off is discriminative power versus locality. Choosing 
a larger cardinality results in a patch representation with a higher discriminative 
power, but less locality. However, very local representation result in a high number 
of correspondence hypotheses. 

For our road scenes we got the best results with large cardinalities (e.g. 20). 
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Census sampling distance: The example in Section 3 shows the design of a Census 
operator of cardinality 8 representing a neighborhood with a sampling distance of 1 . 
However, we also looked at a sampling distance of 2 (skipping the adjacent neighbor 
pixel). The results improved dramatically. This supports an often ignored truth, that 
adjacent pixel values in natural images are not completely independent of each other 
[ 11 ]. 

max_discriminitive_power: This parameter represents the maximal allowed discrim- 
initive power of the Census features. Setting it to a very high value results in a lot 
of potential correspondence interpretations. The constant C which was mentioned 
in Section 1 becomes very large. 

The parameter max -discriminitivejpower is dependent on the cardinality of the 
selected Census Operator. Typical values are 
max -discriminitive-power c -$ = 12, and 
max jliscriminiiivejpower c- 2 o = 3 or 2. 

temporal analysis geometry constraints: Application driven we use a maximum di- 
rection angle change 10 deg, and we allow a vector length change of 30 %. The 
radius r is below 3 pixel. 

min age: The parameter mintage corresponds to the number of predecessors by which 
a flow vector is supported. We set this value typically to 1 or 2. See Figure 6 and 7 
for an example. 




Fig. 6. A typical road scene. The displacement 
vectors are color coded with respect to their 
length. Warmer colors denote longer vectors. 
The displacement vectors in the lower right 
corner along the reflection post have a length of 
40 pixels. The parameter mintage is set to 1. 




Fig. 7. The parameter min_age is set to 2. This 
results in fewer outliers, but also in a slightly 
reduced density. 



6 Experimental Results 

The following results on real images validate the approach. We applied to algorithm to a 
large set of image sequences recorded from a camera platform in a moving car. Figure 6 
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and Figure 7 show a frame from an image sequence of 12 bit images, in Figure 6 the 
parameter mintage is set to 1, in Figure 6 it is set to 2. 

Notice, that even subtle distinctions on the road surface allow for the flow compu- 
tation. The flow in the foreground is the flow on the hood of the car, functioning as a 
mirror of the scene. 

Using a feature primitive type for matching always raises the question about its 
invariance properties. For the Census Transform the invariance with respect to transla- 
tional motion is obvious. However, the signature vector is not implicitely invariant to 
rotational or scale transformations. We performed empirical tests on image sequences 
and observed the number of correspondences. We observe a graceful degradation with 
respect to rotation and scale. At an angle of 8 deg, or alternatively at a scale of 0.8 there 
are still half of the correspondences. 

One of the contributions of our algorithm is its speed. The following times were 
measured on a 1.2 GHz Pentium III computer with our non-optimized program, imple- 
mented in C++. The scene in Figure 8 was processed. The image (frame) dimensions 
are 768 x 284. Every pixel was processed. The pixel depth is 8 bit. 



Altogether 6577 correlation hypotheses were found. The 2513 flow hypotheses are 
shown in Figure 8. 
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Fig. 8. An overtaking van. The original sequence con- Fig. 9. Closeup from Figure 7: the 
sists of half frames. For aesthetical reasons the image aperture problem, 
is displayed vertically scaled. Notice that due to the fil- 
ters FI and F2 no displacements are computed along the 
shadow edge. 
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7 Future Research 

As it can be seen in the closeup Figure 9, there are quite a few stable outliers along 
elongated edges. This is due to the imperfect nature of the filter FI. At the moment the 
adaption of the hard-coded FI -list is performed by collecting signature vectors with a 
high occurrence frequency. It is planned to automate this process. 

At the moment we are investigating other “richer” representations of an image patch 
than the Census Transform. Here, richness is a synonym for discriminating power. We 
expect higher flow densities. 

8 Summary and Conclusions 

Though we present this paper in the context of optical flow, the approach serves as a 
general tool for pixel-precise image matching. It is applicable to other vision problems 
such as finding correspondences in stereo images. Due to the arithmetic-free nature of 
the algorithm, it has the potential of a straightforward implementation on an FPGA. For 
the Census operator itself this was already demonstrated in Woodfill et al. [12], 
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Abstract. The estimation of motion in videos yields information use- 
ful in the scope of video annotation, retrieval and compression. Current 
approaches use iterative minimization techniques based on intensity gra- 
dients in order to estimate the parameters of a 2D transform between 
successive frames. These approaches rely on good initial guesses of the 
motion parameters. For single or dominant motion there exist hybrid 
algorithms that estimate such initial parameters prior to the iterative 
minimization. We propose a technique for the generation of a set of mo- 
tion hypotheses using blockmatching that also works in the presence of 
multiple non-dominant motions. These hypotheses are then refined using 
iterative techniques. 



1 Introduction / Related Work 

Motion is (besides audio) one of the major characteristics of video. Informa- 
tion about motion in a video can be useful in several applications, including 
video annotation, retrieval and compression. Mosaics built using the estimated 
motion can replace keyframes in video annotation and summarization [1], The 
motion parameters themselves can be used as features in video retrieval or fur- 
ther content analysis [2]. Moreover, an efficient description of motion provides a 
straightforward method for compressing videos [3,4]. 

In this paper we will focus on the estimation of motion between two succes- 
sive frames in a video. That is, given two frames, a transformation has to be 
found that maps points in one frame to corresponding points in the other frame. 
Current approaches use two-dimensional parametric motion models in order to 
constrain the possible number of transformations between successive frames [3, 
4, 5, 6, 7, 8]. Iterative minimization techniques are applied in order to determine 
the motion parameters that yield the lowest intensity difference (these are also 
called direct techniques). 

Thus, it is assumed that the intensity of a moving point remains constant 
over time. This assumption and the use of parametric models can cause problems 
in the following cases: 

— The intensity of a moving point changes due to lighting changes, etc. 

— Points that are visible in one frame get occluded or disappear in the other. 
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— The chosen model cannot sufficiently describe the given motion, e.g., if there 

are multiple independently moving objects. 

In [4] and [5] these problems are addressed by using robust minimization tech- 
niques. Points that violate the brightness constancy assumption, occluded and 
differently moving points are treated as outliers and their influence on the es- 
timation is reduced during the minimization process. Said robust methods can 
tolerate up to 50% of such outliers. A dominant motion component is needed for 
these methods to succeed. 

If there are multiple moving objects in a scene (i.e. , no dominant motion 
component can be found), other techniques have to be applied. Ayer et al. [3] 
propose a layered approach that explicitly models multiple motions. Using an 
EM-algoritlrm ( expectation-maximization ), their method alternately refines the 
layer support and the motion parameters for each layer. The number of motion 
layers is limited using an minimum description length criterion. 

One general problem with the use of iterative minimization techniques is the 
need for good initial estimates of the motion parameters. Without such initial 
estimates, the algorithm may converge to local minima of the intensity error 
function, yielding incorrect results. 

In their robust estimator, Smolic et al. [4] address this problem by using 
blockmatching to compute an initial translational motion estimate. This estimate 
is then refined to affine and parabolic motion models using direct techniques. 
They show that this hybrid technique produces good results in the presence 
of large translational motion. However, a dominant motion component is still 
needed for the algorithm to work. Szeliski [6] proposes a similar approach but 
uses phase correlation to get an initial translation estimate. The case of non- 
dominant multiple motions is not addressed. 

Ayer et al. [3] initialize their layered algorithm with 16 layers. Each frame 
is divided into a 4 x 4 grid and the initial motion parameters for the 16 layers 
are computed by using direct robust estimation on each of the grid’s subparts, 
starting with zero motion. Thus their approach still lacks the computation of 
real initial estimates without the use of iterative minimization techniques. 

In the following section we propose a new technique for the generation of a set 
of initial translational motion estimates that is not based on iterative techniques 
and that does not need a dominant motion component. These initial parameters 
are then used as input to a direct layered approach based on [3] . Section 3 shows 
the results of our approach. We conclude our contribution with an outlook of 
our ongoing work in Sec. 4. 

2 Proposed Algorithm 

The proposed hybrid motion estimator is divided into two major stages. The 
first stage generates a set of initial translational motion estimates, called motion 
hypotheses according to [9]. These are then refined using a derivative of the 
layered iterative estimator described in [3]. The following sections describe the 
two stages in more detail. In Sec. 2.3 we will describe the computation of the 
description length that is used in both stages of the algorithm. 
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2.1 Generation of Motion Hypotheses 

This stage is inspired by the work of Wiskott [9] . Wiskott took a dense motion 
field computed with a phase difference technique as input for the hypothesis 
generation and used the results for motion segmentation based on edge matching. 
He used a simple threshold to determine the number of motions. Instead, we will 
use a sparse motion vector field based on blockmatching and apply a minimum 
description length criterion to limit the number of motion hypotheses. 

Input to this stage are two successive frames of one video shot. The central 
idea of the algorithm is the motion histogram. It measures the probability that 
any point moved under a certain translation between a given pair of frames. 
I.e., given a translation t = (t x ,t y ) T , the motion histogram at t determines the 
probability that any point (x, y) T in one frame moved to point {x + t x , y + t y ) T 
in the other. 

In our approach, we build up an approximated motion histogram using the 
results of a blockmatching algorithm on selected points. The use of blockmatch- 
ing is motivated by the following advantages that are desirable in our case: 

— It can identify arbitrary large displacements between two frames. 

— It measures translation by comparing relatively small blocks, so its perfor- 
mance is mostly unaffected by the presence of multiple motions. 

— It does not rely on good initial guesses and there are several efficient algo- 
rithms available. 

The drawbacks of Blockmatching are: 

— In principle, it can measure only translations. 

— It is not as precise as direct iterative techniques. 

— It measures translation by comparing relatively small blocks, so it suffers 
from the aperture and correspondence problem. 

However, the first drawback does not matter in practice. This is due to the fact 
that in real videos, rotation and zoom between two frames are relatively small in 
general [10]. Furthermore, every transformation reduces to simple translations 
if the regarded blocks are small enough. The second drawback is acceptable 
as we only need rough initial motion estimates which are then refined using 
iterative techniques. The precision of blockmatching algorithms should generally 
be sufficient to get close to the desired global minimum of the error surface. For 
the case of dominant motion, this was shown in [4]. 

To cope with the third drawback, we perform blockmatching only at points 
where the aperture and correspondence problem reduce to an acceptable mini- 
mum. For the selection of these points we use the Harris Detector first introduced 
in [11], which was shown to be suitable for motion estimation purposes in [12]. 
The measure yields salient points that can be located well with blockmatching 
techniques. 

As the similarity measure for our blockmatching we use the normalized corre- 
lation coefficient [13], with a block size of 16 x 16 pixels. Choosing the correlation 
coefficient instead of the typically used mean squared difference or mean abso- 
lute difference [14,15,16] has two reasons: The difference measures yield higher 
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values in the case of low similarity, whereas a high correlation coefficient means 
greater similarity. The latter is what we need to use the results in the histogram 
generation. Furthermore, as pointed out in [13], the correlation coefficient can 
be used as a confidence measure for the match. 

As the next step, the results of blockmatching on the selected points are 
written into the motion histogram, weighted by their similarity value. This de- 
creases the influence of bad matches on the resulting histogram. A blockmatching 
vector v = ( v x ,v y ) T increases the value at v in the motion histogram by the 
corresponding similarity value. The spatial information of the motion vectors is 
totally discarded during the creation of the histogram. As a final step in the 
motion histogram creation, it is smoothed through convolution with a gaussian. 

In practice, the blockmatching algorithm will always return some incorrect 
motion vectors. These errors may occur due to lighting changes or if a block lies 
on a motion border. Though, we expect these false vectors to be uncorrelated. 
In contrast, we expect the correct motion vectors to be highly correlated. This 
is based on the assumption that a frame contains to some extent homogenously 
moving regions that produce similar blockmatching results. Although we do not 
know which vectors are correct and which are not, the approximated motion 
histogram gives us an estimate of what translations really occurecl between two 
frames. 

To generate our first motion hypothesis hi, i.e. the initial motion parameters 
of the first layer, we simply take the translation that corresponds to the max- 
imum of the motion histogram. For the following hypotheses we generate new 
histograms, based on the previous histogram and all previous motion hypotheses. 
In each new histogram, the weights of the blockmatching vectors will be recalcu- 
lated, depending on their contribution to the previous histogram maxima. This 
prevents a once found maximum to be selected again as a motion hypothesis. 

The weights of the motion vectors are computed as follows: Let v, be the 
motion vector resulting from the blockmatching with block i and be s t £ [0, 1] 
the corresponding similarity measure (higher values denote greater similarity). 
Then the weights Wjj used for the creation of the jtlr histogram are computed 
from Vj, the similarity value S{, the previous motion hypotheses hi . . . hj_i and 
the variance a 2 of the gaussian used for histogram smoothing: 



Wn = Si 
W ij = * 



(vj-hfc ) 2 

1 — max e °- 2 



J > 1 



(1) 



Given the vectors and the weights, the histograms are approximated by (in- 
cluding the gaussian smoothing): 



m )=£: 






(2) 



The hypothesis generation stops when the description length cannot be fur- 
ther decreased by adding the latest histogram maximum to the motion hypothe- 
ses. The description length is the number of bits needed to fully describe the 
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second frame based on the first frame. The information that needs to be coded 
for that purpose consists of the estimated motion parameters, the layers of sup- 
port for all motion components, and the residual error. 

In general, the residual error will decrease with every motion hypothesis we 
add. In contrast, the number of bits needed to code the motion parameters 
and the layers of support will increase. For a motion hypothesis correspond- 
ing to real (and significant) motion, we expect the total description length to 
decrease [3]. Thus the later histograms, whose maxima are produced by erro- 
neous motion vectors or correspond to insignificant motion, do not contribute 
any motion hypotheses. The underlying principle, called the minimum descrip- 
tion length principle, is also used in the iterative refinement of the hypotheses. 
For its computation see Sec. 2.3. 

Output of this stage is a set of translations that describe the motion between 
the two input frames. 

2.2 Iterative Refinement 

In general, the set of motion hypotheses will yield the optimal description of 
the motion between the two frames. For example, a simple rotation, that can be 
fully described by a single four-parameter-transform, is likely to produce multiple 
motion hypotheses. That is because the local translations caused by the rotation 
will vary between different parts of the frames. 

We adapt the layered approach first described in [3] to refine our initial 
estimates and to then remove motion layers that became obsolete. Instead of 
the pyramidal approach described by Ayer et al., we use a complexity hierarchy 
like that used in [4]. Thus, instead of building gaussian pyramids of the input 
frames, we build up a complexity pyramid with the translational motion model 
on top, the four-parameter model and the affine model at intermediate levels, 
and the perspective model at the bottom. At each level of the pyramid, the 
model complexity is increased, the current motion parameters are refined and 
redundant motion layers are removed. 

The following steps are performed at each level of the pyramid: 

1. The complexity of the motion model is increased. For example, if the previous 
level used translations, the current level uses the four-parameter model. 

2. The motion parameters and layer support are refined to minimize the in- 
tensity difference between the two input frames. This is done using an EM- 
algorithm that alternates the computation of the layers of support with fixed 
motion parameters and refines the motion parameters with fixed support lay- 
ers using robust iterative minimization. 

3. It is tested if the removal of a motion layer reduces the description length. 
If this is the case, the motion layer whose removal leads to the maximal 
decrease of the description length is removed. Refinement and removal are 
repeated until the description length cannot be further decreased. 

2.3 Description Length 

The description length L decomposes into three parts: 
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1. The number of bits L p needed to code the motion parameters for each motion 
layer, 

2. the number of bits L/ needed to code the layer support, i.e., the information 
concerning which points belong to which motion layer, 

3. the number of bits L r needed to code the residual error, i.e., the difference 
between the second frame and the first frame, transformed based on the 
motion layer information. 

The motion parameters are nxm floating point values, where n is the motion 
model’s number of parameters, e.g., n = 6 for the affine model, and m denotes 
the number of motion layers. According to [3] the number of bits needed to code 
these parameters with sufficient accuracy depends on the number of points p 
whose motion they describe. Thus, L p is given as: 

n • to • Id (p) 

— 2 bu 

To compute the description length of the layer support L[, we create a mem- 
bership image l such that (Z(x) = j) (x belongs to layer j). We then measure 
the information content of this image according to [17]. To account for spatial 
coherence in the motion layers, we separate all points into those who lie at the 
border of its motion layer (border points) and those whose neighbours all belong 
to the same layer (inner points). Let pi(j) be the probability of a border point 
belonging to the layer j. Accordingly, let pu be the probability of a point being 
an inner point. Then the information content of the membership image and thus 
the description length Li is given as: 



U=P- Pu ■ Id (pu) + Y2 ~PlU) ' ld (w(i))^ ( 4 ) 

The computation of the residual error description length is similar. However, 
spatial coherence is discounted. Let p r (i) be the probability of a point having the 
residual i. Assuming that the residual errors all lie in [—255, 255], the description 
length L r is given as: 

255 

L r = ^2 -Pr(i) ■ ld(Pr(*)) (5) 

z— — 255 

The total description length L is given by the sum of the three: 

L = L p + Lf + L r (6) 

In the following section we will test our algorithm on real sequences and 
compare it to the non-hybrid direct layered approach. 

3 Experimental Results 

To show the benefits of the hybrid algorithm over the algorithm based only 
on iterative minimization, we apply it to sequences containing relatively large 
motion, the sequences Horse and Basketball (see Fig. 1). The first sequence 
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Fig. 1. From left to right: Frames one and 32 from the sequence Horse, Frames 7 and 
50 from the sequence Basketball 

shows a horseman galloping over a course from right to left, followed by a fast 
camera pan. Most frames do not contain a dominant motion component. The 
second sequence shows a basketball player walking on the field followed by the 
camera. Several frames do not contain a dominant motion component. 

The results of the two approaches are compared in Fig. 2. The higher values 
for description length show the failure of the estimation without motion hypoth- 
esis generation. Note that in the cases where the non- hybrid algorithm converged 
correctly, the processing time was still approximately four times higher than with 
the hybrid algorithm. This is because the computation of initial parameter esti- 
mates permits the iterative minimization to start closer to the global optimum, 
thus leading to faster convergence. 



/onnnn 




?q nnnn 




frame# 
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Fig. 2. Performance of the proposed algorithm on the sequences Horse (left) and 
Basketball (right). The results of the layered approach with (solid line) and without 
(dashed line) motion hypothesis generation are shown in bits needed for coding 



4 Conclusions and Outlook 

We have presented a hybrid algorithm for the estimation of multiple non- 
dominant motions. We showed that it yields superior performance compared 
to the non-hybrid algorithm if the need for good initial estimates is critical, e.g., 
in the presence of large translational motion. In other cases, our approach just 
reduces computation time. We are currently investigating the extension with so- 
called mosaicing techniques in order to create mosaics on scenes with multiple 
moving objects. This will be followed by an evaluation of the applicability of the 
resulting mosaics to video indexing and content retrieval. 
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Abstract. Visual cortical processing is segregated into pathways each consisting 
of several cortical areas. We identified key mechanisms of local competitive inter- 
action, feedforward integration and modulatory feedback as common principles 
of integration and segregation of ambiguous information to implement a princi- 
ple of evidence accumulation and feedback hypothesis testing and correction. In 
a previous work we demonstrated that a model of recurrent VI -MT interaction 
disambiguates motion estimates by filling-in. Here we show that identical mecha- 
nisms along the ventral V1-V2-V4 pathway are utilized for the interpretation of (1) 
stereoscopic disparity and (2) relative depth segregation of partially overlapping 
form. The results show that absolute and relative depth ambiguities are resolved 
by propagation of sparse depth cues. Lateral inhibition emerges at locations of 
unambiguous information and initiates the recurrent disambiguation process. Our 
simulations substantiate the proposed model with key mechanisms of integration 
and disambiguation in cortical form and motion processing. 



1 Introduction 

Feedback processing plays a crucial role in neural information processing. Many phys- 
iological studies substantiate the presence of feedback connections and its influence on 
neural activity in earlier areas [1,2], Technically spoken, such a recurrent signal of in- 
formation can be interpreted as predictor from higher areas representing more reliable 
information due to a larger spatio-temporal context. In this work we present a general 
framework of bidirectional information processing, which is demonstrated to model 
parts of the dorsal and ventral pathway. 

Model outline. We present a unified architecture to integrate visual information 
and resolve ambiguities for motion, stereo and monocular depth perception. The basic 
principles of our model have been developed in the context of shape processing and 
boundary completion [3,4]. The model consists of two bidirectional connected areas, 
each of which implements identical mechanisms, namely feedforward integration, lat- 
eral interaction and excitatory feedback modulation (see Fig. 1 and section “Methods”). 
Both model areas differ only in the size of receptive fields and the input layer. The input 
layer to the first model area consists of motion sensitive cells (for motion perception), of 
disparity sensitive cells (for binocular depth perception), or of cells sensitive to occlu- 
sions representing monocular cues of relative depth ordering or figure/ground separation. 
The activity distribution generated by the first model area projects to the second model 
area, which contains cells with larger receptive fields. Thus, neural activity in the latter 
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area contains more context information and, as a consequence, is less ambiguous. In 
both model areas the summed output of all cells at each location (sensitive to different 



feature configurations) is utilized for nor- 
malization, which leads to flat tuning curves 
for ambiguous feature configurations, while 
sharp tuning curves occur for unambigu- 
ous signals [5], Excitatory feedback modu- 
lation, in turn, acts as a predictor and serves 
to amplify neural activity in the previous 
model area, which matches the expectations 
of the higher model area [6]. The advantage 
of purely excitatory modulatory feedback is 
that the recurrent signal cannot generate ac- 
tivity in the absence of a feedforward sig- 
nal and that feedforward information is not 
affected (inhibited) when no feedback sig- 
nal is present [3,7,8]. The entire network 
realizes a mechanism of feedback hypoth- 
esis testing and correction in a biological 
plausible way, since purely excitatory back 
projections from higher areas to lower ar- 
eas are applied [ 1 ] and only simple center- 
surround feedforward interaction schemes 
are utilized. The results demonstrate that 
our model is capable of processing different 
modalities, which substantiates our claim 
that we identified key mechanisms of inte- 
gration and disambiguation in cortical form 
and motion processing. 




Fig. 1 . Sketch of the model architecture with 
two bidirectionally interconnected model ar- 
eas. The main difference between both model 
areas is the spatial size of RFs. 



2 Results 

Motion. We previously already presented our model for the perception of visual mo- 
tion [7,9,10]. Here we only give a brief summary of computational results and the 
predictions arising from our simulations with motion sequences. An important detail is 
that feedback connections have to be shifted in order to follow the predicted signal (e.g. 
the target location of feedback from a cell indicating rightward motion has to be shifted 
to the right). Fig. 2a illustrates how initial unambiguous motion estimations are gener- 
ated at line endings or junctions, while ambiguous motion is detected along elongated 
contrasts indicating normal flow (motion parallel to local contrast orientation; aperture 
problem). Consistent with physiological data [11], such ambiguities are resolved over 
time. Cells in the second model area (corresponding to cortical area MT) initially in- 
dicating normal flow gradually switch to finally signal the correct velocity. The spatial 
distance of ambiguous motion cues to unambiguous motion cues determines the time 
needed to propagate reliable information needed for disambiguation. Thus, the model 
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Fig. 2. Computational results processing motion sequences, (a) Example showing the solution 
of the aperture problem. Results obtained by processing an artificial sequence of a moving bar 
(220x85 pixel). Bold arrows illustrate the velocity indicated by cells in model VI at different 
time steps of iterative bidirectional model VI -MT processing. Left. Initial velocity estimation 
is only accurate at line endings. Along elongated boundaries only normal flow, parallel to local 
gradient orientation can be detected (aperture problem). Center. After 4 steps of iterative feedback 
processing motion features at line endings have propagated inwardly to represent unambiguous 
motion along the boundary corresponding with the true image motion. Right. After 8 iterations 
motion is completely disambiguated and the correct motion is indicated for the entire object, 
(b) Example showing that the model also successfully processes real-world sequences (259x190 
pixel). 

solves the aperture problem through temporal integration in a scale invariant manner. As 
a consequence of feedback the motion signal in both model areas is disambiguated. We 
thus predict, that cells in area VI (or at least a subset of cells in area VI) have similar 
temporal activity patterns as cells in MT. 

We further demonstrate that our model is also capable of processing real-world im- 
ages (Fig. 2b). One set of parameter settings was used for all simulations (see section 
“Methods”). 

Stereo. The perception of stereoscopic information is influenced by the correspondence 
of visual features in the left and right retinal images and by half-occlusions, which arise 
from features which are only visible from one eye [12], Here we only consider the cor- 
respondence problem, which can be solved by applying the motion model on spatially 
displaced stereoscopic images instead of temporally delayed images (image sequences). 
To extract depth information from binocular images, the displacement of image features 
(disparity) from one eye to the other has to be detected. The only difference is that 
the detected displacements have to be interpreted as disparity instead of visual motion. 
Thus, the predictive feedback from the higher area has not to follow the motion signal, 
but instead to preserve its spatial location. We do not consider vertical disparities [13], 
therefore the range of detected vertical shifts was restricted to ±1 pixel, which could 
arise from badly aligned stereo images. Fig. 3 shows a pair of stereo images and the 
detected horizontal disparities using the identical model as used for motion detection 
in Fig. 2 (except for minor changes mentioned above). The filling-in property, which 
disambiguates the motion signal in previous experiments, also disambiguates stereo cues 
and successfully integrates and segregates absolute depth information. Similar to results 
for motion processing (Fig. 2), the model is also capable of processing real-world images 
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Fig. 3. Computational results processing a random dot stereogram, (a) input stimulus (100x100 
pixel, left-right-left) showing a pyramidal object with three depth layers, (b) Cells with isotropic 
Gaussian RFs or (c) oriented bipole cells (sum over all orientations) are used to realize feedforward 
integration in model area 2 (d). (e-f) Results for different kinds of feedforwad information (su- 
perimposed RF icons are bigger than actual RFs). Depth indicated by cells in the first model area 
at different time steps of iterative bidirectional information processing (normalized gray-coded 
depth information, dark=far, light=near). Bipole cells (c) lead to a sharper segregation of objects 
lying in different depths (f) than cells with isotropic RFs (b,e). 

(not shown). A modification of the model utilizes the sum of oriented bipole cells (8 
orientations) [3] instead of isotropic kernels. Fig. 3f illustrate that bipole cells lead to a 
sharper segregation of objects lying in different depths. 

Relative depth ordering. There are various monocular depth cues, such as occlusion, 
blurring, linear perspective, texture, shading, and size. Here we focus on relative depth 
from occlusions indicated by T-junctions (see Fig. 4a). Physiological studies [14] re- 
vealed that such overlay cues (combined with shape cues and binocular disparity) are 
utilized by cells of the ventral pathway (V1-V2-V4) to indicate border ownership. The 
input to our model is composed of features such as the absolute luminance gradient of 
the input image and T-junctions (illustrated in Fig. 4b), which were labeled manually, 
although this could be achieved automatically [15]. At each location there is a set of 
cells, sensitive to different relative depths. Initially all cells at each location with high 
contrast are stimulated in the same way (max. ambiguity) (see 4d, t=0). Propagation 
of relative depth is achieved in the same manner as for motion and stereo by isotropic 
Gaussian filters, except at locations with T-junctions. Here, the excitatory flow of in- 
formation is restricted to follow the top of the T along object boundaries (indicated 
by connected white discs in Fig. 4c). Additionally, we included inhibitory connections 
between cells indicating the same depth located at the stem and the top of the T [16] 
(indicated by connected black discs in Fig. 4c, see section “Methods”). Computational 
results (Fig. 4d) illustrate how the model works. Remarkably, unresolvable ambigui- 
ties remain unresolved indicating multiple depths (see Fig 4d, t=150). Note, that if the 
number of represented depth layers is larger than the number of actual depth layers (in 
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Fig. 4. Example illustrating the disambiguation of complicated depth ordering tasks, (a) input 
stimulus, (b) Normalized absolute luminance gradient and manually labeled occlusions (the spa- 
tial extent of receptive fields used for integration is overlaid), (c) Schematic interaction outline 
of subfields for cells at T-junctions. Information is guided along occluding boundaries (excitatory 
connection, white discs), while inhibitory connections (black discs) realize depth discontinuities 
(see section “Methods”), (d) Computational results at different time steps of feedback interaction. 
Activities of individual cells are illustrated on n = 5 different depth layers (light=low activi- 
tiy, dark=high activitiy). Note that unresolvable ambiguities remain unresolved, such as for the 
rectangle (1) for t= 150, which could be located at the same depth as rectangle (2) and rectangle (3). 

the stimulus), some objects are correctly represented on different contiguous depths. If 
the number of represented depth layers is insufficient to represent the scene, additional 
depth discontinuities appear along object boundaries. Our model predicts that contrast 
selective cells tuned to different depth layers (foreground/background) show a typical 
temporal development: cells located near occlusions are disambiguated earlier than cells 
far away from monocular depth cues. Fig. 5 shows the time course of neural activity 
sampled from six cell populations sensitive to figure/ground information at three differ- 
ent positions (1,2,3) for two different stimuli illustrating the model’s temporal activity 
pattern. 



3 Methods 



The following equations describe the model dynamics of all model areas for iterative 
disambiguation of motion, stereo, and relative depth. We present the steady state versions 
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Fig. 5. Model simulations illustrating model predictions concerning figure/ground segregation of 
border-selective cells using n = 2 relative depth layers. (a,b) Two examples with locally identical 
contrast configurations. Activity of foreground and background cells is pooled in a small spatial 
neigborhood (9 pixel) at different locations (1-3). Different configurations of occlusion lead to 
different temporal activity patterns indicating different figure/ground segmentations. To generate 
more realistic temporal activity patterns, neural fatigue is realized through the same mechanisms, 
as for gated dipoles (see section “Methods”), (c) Activity of individual cells on different depth 
layers at different time steps for both examples. 

of the equations used for all simulations. Differences for individual simulations or model 
areas are discussed below. 

t/ 1 ) realizes the feedback modulation of the input to the model (netipj) with the 
feedback signal from higher areas (nef^s). netEXT is an additional external input (see 



below). C is a constant to control the strength of feedback and is set to C = 100 for 
model area 1 and C — 0 (no feedback) for model area 2 for all simulations. 

fi 1 -* = netix ■ (1 + C ■ netFB) + netEXT (1) 

w ( 2 ) = ( v «) 2 * (2) 

v (3) = ( w (2) _ B /2 n . ^V 2) ) / (0.01 + ^ f t; (2) ) (3) 



v 1 - ' 2 ' is the result of feedforward integration. Except for stereo integration with ori- 
ented filters (see Fig. 3), feedforward integration is realized by isotropic Gaussian fil- 
ters in the spatial and feature (motion or depth) domain: <ji = 0 (dirac) for model 
area 1 and <ji ss 7 for model area 2; 02 = 0.75 for both model areas, x encodes 
the spatial location and f different feature configurations (e.g. velocities or depth). 
For stereoscopic integration with bipole cells we used the sum over all orientations 
^ (2) = Ee ^i X) {(> (1) ) 2 } * , where describes the multiplicative spatial inter- 

action of subfields of bipole cells (see Fig. 3d). 

v ® describes the lateral competition, which is realized by shunting inhibition, v ^ 
is used for both, as feedback signal to earlier areas and as input to succeeding areas. 
n is the number of cells tuned to different features at one location. B = 1 for motion 
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and stereo disambiguation and B = 0 for relative depth integration ( / j describes the 
subtractive part of inhibition). 

For motion and stereo disambiguation the input signal net-ix to the first area 
is realized through correlation detectors [17] and netEXT — 0. For relative depth 
disambiguation, the input signal netj n is the normalized absolute luminance gra- 
dient (shunting inhibition), copied to n different depth layers. netEXT is 0 ex- 
cept near T-junctions (occlusions). At T-junctions netEXT is used to initiate the lo- 
cal disambiguation of relative depth. The possible depths indicated by neural activ- 
ity patterns at the top (U) of T-junctions are used to interact with the activities in 
the different depth layers near the stem (L) of the T. net^^T inhibits cells indi- 
cating impossible depths at the stem of the T according to the following formula: 
net^xT^n) = — f([depthlayer(x, n) — depthlayer(x.,n — 1)]+)- / is a func- 
tion which links the locations of the top of the T to locations of the stem of the T. 
Analogously, cells near the top of the T are inhibited using the following formula: 
net.^ vt( x > n ) = — / -1 ([depthlayer(x, n) — depthlayer(x, n + 1)] + )- The value of 
net ext is computed as the sum of net ] and net e~\t- To generate the results in 
Fig. 5, we gated in 3 ) with an additional term, similar to those used for gated dipoles 
in order to simulate neural fatigue [18]. Finally a positive constant (0.4) is added to 
net ext- 



4 Discussion and Conclusion 

In this contribution we propose a general architecture for information integration and 
disambiguation. We demonstrate that models based on the proposed architecture are 
capable of processing different modalities, such as visual motion, binocular form infor- 
mation, and monocular depth ordering cues. We already presented and discussed our 
model for the integration of visual motion. Here we focused on binocular feature pro- 
cessing and monocular depth ordering. Compared to classical models of stereoscopic 
cue integration, our model could be classified as a cooperative-competitive model [19]. 
The feedback signal acts as ordering constraint and lateral interaction as uniqueness 
constraint [20]. In contrast to classical approaches we realized the spreading of informa- 
tion by feedback integration, while lateral interaction accounts for saliency estimations, 
which initiates the filling-in process. In contrast to motion and stereo stimuli, the least 
ambiguous relative depth ordering cues are still very ambiguous: in the presence of at 
least three objects a single occlusion does not yield enough information to judge if one 
object of interest is in the foreground or not. In contrast to motion and stereo, nearly 
no globally valid information can directly be extracted from initial input patterns. Thus, 
spreading of less ambiguous information is substantial to figure ground segmentation. 
Our model suggests that bidirectional processing in the ventral pathway is an indispens- 
able process to relative depth perception. As a consequence, the model prediction of 
recurrent flow of information leads to impairment in figure ground segmentation tasks if 
feedback is suppressed. Our simulations further generate some predictions concerning 
the temporal development of cells selective to figure ground information or border own- 
ership, which could be verified in cortical cells of the form path. Due to bidirectional 
processing, it should not matter, whether cells are observed in VI, V2 or V4. Our com- 
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putational results illustrate that the same architecture is capable of processing different 
kinds of modalities, and thus substantiate the proposed model with key mechanisms of 
integration and disambiguation in cortical information processing. 
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Abstract. In this paper we address one of the standard problems of 
image processing and computer vision: The detection of points of inter- 
est (POI). We propose two new approaches for improving the detection 
results. First, we define an energy tensor which can be considered as 
a phase invariant extension of the structure tensor. Second, we use the 
channel representation for robustly clustering the POI information from 
the first step resulting in sub-pixel accuracy for the localisation of POL 
We compare our method to several related approaches on a theoretical 
level and show a brief experimental comparison to the Harris detector. 



1 Introduction 

The detection of points of interest (POI) is a central image processing step for 
many computer vision systems. Object recognition systems and 3D reconstruc- 
tion schemes are often based on the detection of distinct 2D features. Since the 
further processing often relies on the reliability of the detection and the accuracy 
of the localization, high requirements are put on the detection scheme. However, 
most systems simply use standard operators, e.g. the Harris detector [1], as a 
black-box without noticing some serious signal theoretic and statistical prob- 
lems. The detection of POI mostly takes place in two steps: 

1. The image is subject to a (mostly non-linear) operator which generates a con- 
tinuous response (called POI energy in the sequel). 

2. Relevant maxima of the POI energy are obtained by thresholds and non- 
maximum suppression. 

In this paper we show up alternatives for both steps. The two proposed methods 
can be used in conjunction or separately: 

1. For the generation of the POI energy we propose a new 2D energy tensor, 
which can be considered as a combination of a quadrature filter and the struc- 
ture tensor. 

2. For the detection of relevant maxima, we propose a clustering method based on 
the channel representation , which allows to detect POI with sub-pixel accuracy 
and includes non-maxima suppression as a natural property of the decoding. 

* This work has been supported by DFG Grant FE 583/1-2 and by EC Grant IST- 
2002-002013 MATRIS. 
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The main advantages of the novel approaches are: 
phase invariance and suppression of aliasing in the POI energy, 

- robustness of the detection (clustering), and 

- sub-pixel accuracy. 

The implementation of the proposed POI detector is straightforward and of low 
computational complexity. The (few) parameters have an intuitive meaning and 
are stable in a wide range. We describe both methods in detail in Sect. 2 and 3, 
respectively, and in Sect. 4 we summarise the results. 

2 The Energy Tensor 

In this section we introduce the 2D energy tensor. We start with briefly reviewing 
the ID energy operator, which is a well known technique in the field of speech 
processing. We then propose a new 2D approach, the energy tensor, and relate 
it to several other approaches in the field of image processing. 

2.1 The ID Energy Operator 

This brief review of the ID energy operator is based on [2] , but the operator was 
first published in [3] . The purpose of the energy operator is to compute directly 
the local energy of a signal, i.e. , the squared magnitude of a quadrature or Ga- 
bor filter response, without computing the Hilbert transform. The shortcoming 
of all Hilbert transform based methods is the theoretically infinite extent of the 
Hilbert kernel whereas the energy operator is much more localised. Even if the 
quadrature filter is synthesized directly or optimized iteratively, there is a trade- 
off between large filter extent and phase distortions (which also imply amplitude 
distortions). This problem becomes even more severe in 2D. 

For continuous signals s(f), the energy operator is defined as 

W)] = [s(f)] 2 - S (f)s(t) . (1) 

Switching to the Fourier domain, this equals 

f iF c [s(t)] exp(— iut) dt = — [(to>S'(ci;)) * ( iujS(to )) — S{uS) * (— w 2 S'(ci;))] , (2) 
J 27T 

where S(co) is the Fourier transform of s(t). If the signal is of small bandwidth, it 
can be approximated by an impulse spectrum S(u) = AS(co — wo) + A5(u> + wo)- 
Inserting this spectrum in the left part of (2) yields 

r _ U) 2 

/ [s(f)] 2 exp(— icot) dt — 0 (A 2 6(u> — 2u>o ) — 2AAS(lo) + A 2 8(u> + 2wo)) ■ 

(3) 

The right part of (2) gives the same expression, but with a positive sign for the 
second term (+2AAS(uj)), such that 

[ \^ c [s(t)] exp(-iujt) dt = — luq\A\ 2 5(uj) . 



(4) 
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As it can be seen from this expression, the energy operator is phase invariant. 
It is important to notice that the square of the first derivative results in echo 
responses with frequency 2loq which become alias components if u > o is larger than 
one fourth of the sampling frequency. For the 2D case, i.e. , the structure tensor, 
this fact was pointed out in [4] where the author suggests to use an oversampling 
scheme. As it can be seen from (4), the product of signal and second derivative 
compensates the aliasing components which makes oversampling unnecessary. 



2.2 The 2D Energy Tensor 

In the literature a few attempts for definitions of a 2D energy operator can be 
found, see, e.g., [5]. Other related approaches are based on the idea of generalised 
quadrature filters using second order terms, i.e., product of filter responses, see 
[6,7]. All these approaches will be considered more in detail and compared to 
the 2D energy tensor. For continuous, 2D bandpass signals, i.e. bandpass filtered 
images, 6(x), x = ( x,y) T , the 2D energy tensor is defined as 

•f c [6(x)] = [V6(x)][V6(x)] t - 6(x)[H6(x)] , (5) 

where V = ( d x , d y ) T indicates the gradient and H = VV T indicates the Hessian. 
Switching to the Fourier domain, this equals 

J ^ c [b(x)] exp(— i27ru T x) dx = 47r 2 { — [u5(u)] * [uil(u)] T + B( u) * [uu T B(u)]} , 

(6) 

where B(u) (u = (u,v) T ) is the 2D Fourier transform of b(x). If the signal is 
of small bandwidth, it can be approximated by an impulse spectrum B{ u) = 
A<5(u — uo) + A<5(u + u 0 ). Inserting this spectrum in the left part of (6), i.e., the 
structure / orientation tensor according to [8,9], yields 

-[uB(u)] * [uB(u)] t = — uouj (A 2 h(u — 2uo) — 2AA8(vl) + A 2 <5(u + 2uo)) . 

(7) 

The right part of (6) gives the same expression, but with a positive sign for the 
second term (+2AA<5(u)), such that 

J <B c [6(x)] exp(— *27ru T x) dx = 167r 2 u 0 Ug |A| 2 i5(u) . (8) 

The energy tensor is a second order symmetric tensor like the structure tensor. 
The latter is included in the energy operator, but it is combined with a product 
of even filters, which assures the phase invariance as it can be seen in (8). The 
energy tensor can hence be classified as a phase invariant, orientation equivariant 
second order tensor [10]. Same as the 2D structure tensor, the energy operator 
can be converted into a complex double angle orientation descriptor [11]: 



o(x) = <F c [6(x)]u - <Pc[&(x)] 22 + *2^ c [6(x)]i2 



(9) 
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which is equivalent to the 2D energy operator defined in [12]. As one can easily 
show, |o(x)| = Ai(x) — A 2 (x) , where Ai(x) > A 2 (x) are the eigenvalues of the 
energy tensor. Since the trace of the tensor is given by the sum of eigenvalues, 
we obtain 2 Aq 2 = tr(<? c [6(x)])±|o(x)|, which can be subject to the same analysis 
as suggested in [13,14] or for the Harris detector [1], However, a minor problem 
might occur in the case of not well defined local frequencies: the second term in 
(5), i.e., the tensor based on even filters, can become positive, corresponding to 
reduced or negative eigenvalues of the energy tensor. In this case, the estimate 
is not reliable and should be neglected by setting the response to zero. 

The operator (5) cannot be discretised directly since natural images are typi- 
cally no bandpass signals. For this reason and in order to compute the derivatives 
for discrete data, the operator has to be regularized by a bandpass filter. For 
this purpose, we chose differences of Gaussian (DoG) filters since 

1. Derivatives of DoG filters are easy to compute by using the Hermite polyno- 
mials. 

2. Gaussian kernels are well approximated by truncation (rapid decay) or by 
binomial filters. They are much more localised than spherical harmonics. 

Since the DoG filter is a comparably bad bandpass filter, multiple frequencies 
might occur. If these are not in phase, negative eigenvalues are possible. Further 
regions with potentially negative eigenvalues are intrinsically 2D neighborhoods. 
However, negative eigenvalues only occur sparsely and are easily compensated 
by setting the tensor to zero and by the in-filling of the subsequent processing. 

Comparable to the ID case, one could compute a 2D quadrature filter re- 
sponse and square its magnitude to obtain the energy (tensor). With the mono- 
genic signal [16] a suitable approach exists which is compatible to the proposed 
2D energy tensor concerning the phase model. In the monogenic signal, phase 
and orientation form a 3D rotation vector and invariance w.r.t. the phase results 
in a projection onto a circle representing the orientation. As in the ID case, us- 
ing the energy tensor instead of 2D quadrature filter is advantageous due to the 
better localization. The 2D pendant to the Hilbert transform is the Riesz trans- 
form, which also suffers from a polynomial decay, resulting in 2D quadrature 
filters with either large support or phase distortions. 



2.3 Comparisons to Other Approaches 

The advantage of the energy tensor compared to the structure tensor has already 
been mentioned above. The energy tensor avoids aliasing and is phase invariant, 
see also Fig. 1. A quite related approach to phase invariant operators was pre- 
sented in [6] . The difference there is, however, that the Hessian of the signal is 
multiplied with the Laplacian of the image, which does not result in a phase 
invariant descriptor in a strict sense. 

The approach of the boundary tensor in [7] proposes a similar combination 
of zero to second order terms in a tensor. However, the author proposes to use 
the square of the Hessian matrix for the even part of the tensor. Furthermore he 
makes use of spherical harmonics of even and odd orders (see also [17]) instead 
of Gaussian derivatives, which leads to less compact filter kernels. 
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Fig. 1 . From left to right: detail from the image in Fig. 3, difference of the eigenvalues 
for the structure tensor, the second tensor in (5), and the energy tensor (divided by 
two). One can clearly see the echo responses for the first two tensors. The output from 
the energy tensor seems to be more blurred, but the applied regularisation (DoG filter 
with variances 1 and 2) of the derivatives was the same in all cases, which means in 
particular that all three responses show the same amount of rounding of corners. 

The energy operator suggested in [12] is equivalent to the complex orientation 
descriptor of the energy tensor. By considering solely the complex descriptor, the 
rejection of negative eigenvalues becomes impossible. Furthermore, the authors 
use spherical harmonics with constant radial amplitude response, since they are 
interested in fringe analysis which implies small bandwidth signals. The comment 
on spherical harmonics above also applies in this case. 

The energy operator suggested in [5] combines the ID energy operators in 
x- and y-direction to compute the 2D response. Due to the missing cross-terms, 
this operator is not rotation invariant. It is compatible to extending quadrature 
filters to 2D by computing ID filters in x- and y-direction and therefore, it 
corresponds to partial Hilbert transforms [18]. 



3 Channel Clustering of Orientation Information 

As pointed out in Sect. 1, the second step for the POI detection is the clustering 
of the POI energy. In order to obtain robustness, we use channel smoothing (see 
[19] for details) of the complex orientation descriptor, modified by the test for 
positive eigenvalues. In the channel representation, the feature axis (here: the 
double angle orientation 0(x) = arg(o(x))) is sampled with a compact, smooth 
basis function (a quadratic B-spline B 2 {-) in our case): 

c„(x) = £ 2 (0(x) — i'Kn/N) n = 1 . . . N . (10) 

The result is a pile of similarity maps, indicating the distance between the respec- 
tive sampling points (channel centres) and the feature values. At every spatial 
position we therefore obtain an ND channel vector with non-negative entries 
which are large if the feature value is close to the corresponding channel centre 
and small (or zero) for large distances, see Fig. 2 for an example. 

Channel smoothing means to average the channels along the spatial axes. Er- 
godicity of the channels (easier to assure than ergodicity of signals!) then implies 
that the channel vector at every spatial position approximates the probability 
density function of the feature at this position. Elementary task of the decoding 
process is to find the modes of this pdf in every point. In case of B-spline chan- 
nels, an approximative decoding scheme is obtained by normalised convolution 
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Fig. 2. Channel smoothing in ID. Left: a noisy signal (solid line) is encoded in channels. 
The channel values are represented by the size of the circles. Right: after averaging the 
channels (along the rows), the channel vectors (column vectors) approximate the pdfs 
and channel decoding means to extract the modes of these pdfs. Taking the first mode 
in every point results in the smoothed signal (solid line). 



[14] along the channel vector [19] (the spatial argument is omitted here): 



e= 2 Z 



^ ra+1 Cm— 1 
N V C-m—l + Cm H - C m - \-i 



m = argmax(c m _i + c m + c m+ i) . 



( 11 ) 



In our particular case, we are interested in obtaining robust orientation esti- 
mates. Since orientation represents spatial relations, we can make use of the fea- 
ture value, respective the channel centre, for choosing an anisotropic smoothing 
kernel, see [20]. These kernels can be learned from the autocorrelation functions 
of the channels by parametric or parameter-free methods. For the experiments 
below, we generated the kernels by optimising the parameters of a kernel which 
has a Gaussian function as radial amplitude response and a cos 2 function as 
angular amplitude response (see also the ’hour-glass’ filter in [4]). 

Instead of extracting the first mode from the smoothed orientation channels 
corresponding to the main orientation, we extract the residual confidence R = 
Y^n= i Cn — Cm - 1 — c m — c m +i as POI energy. Since we use weighted channels, 
i.e., after the encoding the channel vectors are weighted by the difference of the 
eigenvalues, the confidences obtained after decoding correspond to the product 
of local energy and residual orientation error. 

In order to get a POI list from the POI energy, we also make use of the chan- 
nel representation. The orientation channels can be considered as ID channel 
vectors over the spatial domain, but they can also be considered as a 3D channel 
representation of orientation, x-, and y-coordinate (similar to the channel repre- 
sentation of triplets in [21]). In this case, we consider a 3D (energy weighted) pdf, 
where each mode corresponds to a certain feature value at a certain spatial po- 
sition. By extracting the residual confidences for the orientation, we project the 
(weighted) 3D pdf onto a (weighted) 2D pdf for POI. Knowing the radial shape 
of the averaging filters (Gaussian kernel), we can consider this instead of the 
quadratic B-spline to be the basis function for our 2D channel representation of 
the spatial position. From the basis function we obtain directly the correspond- 
ing decoding scheme to extract the modes - we can apply the same method as 
described above but with larger support for the normalized convolution since 
the Gaussian function corresponds to a B-spline of infinite degree. 
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Fig. 3. Left: first 75 corner features detected with the Harris detector and subsequent 
non-maximum suppression. Right: first 75 corner features detected with the proposed 
combination of energy tensor and channel clustering. 

By extracting the modes, we get a sorted list of coordinates with correspond- 
ing confidences. This can be compared to performing a detection of local maxima 
and sorting these according to their absolute height. However, mode detection 
and maximum detection are not the same in general and channel decoding sorts 
the modes according to their robust error which is not necessarily related to 
the height of the maximum. Furthermore, the decoding of channels is a con- 
ceptional sound way with a statistical interpretation, i.e., it can be investigated 
in a statistical sense and probabilities can be calculated for the whole process. 
Finally, the decoding is not ad-hoc, but pre-determined by the averaging kernel 
which is subject to an optimisation process, i.e., the whole method is generic 
and adaptive. 

To illustrate the results which can be achieved with the proposed method, we 
have run an experiment on one of the common test images for corner detection, 
see Fig. 3. Compared to the Harris detector, especially the higher localisation 
accuracy is striking. However, thorough experiments and comparisons according 
to [22] have to be done to verify the superior spatial localisation more in general. 

4 Conclusion 

We have proposed two new ingredients for POI detection algorithms: The energy 
tensor and the channel clustering. The energy tensor is a phase invariant and 
alias-free extension of the structure tensor which generalises the ID energy oper- 
ator. The channel clustering allows to detect spatially local energy concentrations 
by detecting them as modes of a 2D pdf, which is comparable to the detection 
of local maxima. Both methods have a consistent theoretic background and the 
performance of the implemented algorithm is superior to the Harris detector, 
especially concerning localisation. 
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Abstract. We introduce an approach for 3D segmentation and quantifi- 
cation of vessels. The approach is based on a new 3D cylindrical paramet- 
ric intensity model, which is directly fit to the image intensities through 
an incremental process based on a Kalman filter. The model has been 
successfully applied to segment vessels from 3D MRA images. Our ex- 
periments show that the model yields superior results in estimating the 
vessel radius compared to approaches based on a Gaussian model. Also, 
we point out general limitations in estimating the radius of thin vessels. 



1 Introduction 

Heart and vascular diseases are one of the main causes for the death of women 
and men in modern society. An abnormal narrowing of arteries (stenosis) caused 
by atherosclerosis is one of the main reasons for these diseases as the essential 
blood flow is hindered. Especially, the blocking of a coronary artery can lead 
to a heart attack. In clinical practice, images of the human vascular system are 
acquired using different imaging modalities, for example, ultrasound, magnetic 
resonance angiography (MRA), X-ray angiography, or ultra- fast CT. Segmen- 
tation and quantification of vessels (e.g., estimation of the radius) from these 
images is crucial for diagnosis, treatment, and surgical planning. 

The segmentation of vessels from 3D medical images, however, is difficult and 
challenging. The main reasons are: 1) the thickness (radius) of vessels depends 
on the type of vessel (e.g., relatively small for coronary arteries and large for 
the aorta), 2) the thickness typically varies along the vessel, 3) the images are 
noisy and partially the boundaries between the vessels and surrounding tissues 
are difficult to recognize, and 4) in comparison to planar structures depicted in 
2D images, the segmentation of curved 3D structures from 3D images is much 
more difficult. Previous work on vessel segmentation from 3D image data can be 
divided into two main classes of approaches, one based on differential measures 
(e.g., Roller et al. [6], Krissian et al. [7], Bullitt et al. [2]) and the other based 
on deformable models (e.g., Rueckert et al. [10], Noordmans and Smeulders [8], 
Frangi et al. [3], Gong et al. [5]). For a model-based 2D approach for measuring 
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Fig. 1. Intensity plots of 2D slices of a thin vessel in the pelvis (left), the artery iliaca 
communis of the pelvis (middle), and the aorta (right) in 3D MR images. 

intrathoracic airways see Reinhardt et al. [9]. The main disadvantage of differ- 
ential measures is that only local image information is taken into account, and 
therefore these approaches are relatively sensitive to noise. On the other hand, 
approaches based on deformable models generally exploit contour information of 
the anatomical structures, often sections through vessel structures, i.e. circles or 
ellipses. While these approaches include more global information in comparison 
to differential approaches, only 2D or 3D contours are taken into account. 

We have developed a new 3D parametric intensity model for the segmenta- 
tion of vessels from 3D image data. This analytic model represents a cylindrical 
structure of variable radius and directly describes the image intensities of vessels 
and the surrounding tissue. In comparison to previous contour-based deformable 
models much more image information is taken into account which improves the 
robustness and accuracy of the segmentation result. In comparison to previ- 
ously proposed Gaussian shaped models (e.g., [8], [5]), the new model represents 
a Gaussian smoothed cylinder and yields superior results for vessels of small, 
medium, and large sizes. Moreover, the new model has a well defined radius. In 
contrast, for Gaussian shaped models the radius is often lreuristically defined, 
e.g., using the inflection point of the Gaussian function. We report experiments 
of successfully applying the new model to segment vessels from 3D MRA images. 

2 3D Parametric Intensity Model for Tubular Structures 

2.1 Analytic Description of the Intensity Structure 

The intensities of vessels are often modeled by a 2D Gaussian function for a 
2D cross-section or by a 3D Gaussian line (i.e. a 2D Gaussian swept along the 
third dimension) for a 3D volume (e.g., [8], [7], [5]). However, the intensity profile 
of 2D cross-sections of medium and large vessels is plateau-like (see Fig. 1), 
which cannot be well modeled with a 2D Gaussian function. Therefore, to more 
accurately model vessels of small, medium, and large sizes, we propose to use 
a Gaussian smoothed 3D cylinder, specified by the radius R (thickness) of the 
vessel segment and Gaussian smoothing a. A 2D cross-section of this Gaussian 
smoothed 3D cylinder is defined as 

gmsk {x,y, R, a) = Disk (a:, y, R) * G 2 a D (x,y ) 



(1) 
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where * denotes the 2D convolution, Disk (x, y , R) is a two- valued function with 
value 1 if r < R and 0 otherwise (for r = \J x 2 + y 2 ), as well as the 2D Gaus- 
sian function G 2D (x,y) = G a (x) G„ (y), where G a (x) = (y/2i rer) 1 e - ^. By 
exploiting the symmetries of the disk and the 2D Gaussian function as well as 
the separability of the 2D convolution, we can rewrite (1) as 

gDisk{x,y,R,cr) = 2 J G a (r — rj) (V R 2 - rf'j dy 

- ( r + R )- (r - R)) (2) 



using the Gaussian error function <P (x) = J* (27t) -1 ^ 2 e~ ^ / 2 d £ and <P a ( x ) = 
<P {x/a). Unfortunately, a closed form of the integral in (2) is not known. There- 
fore, the exact solution of a Gaussian smoothed cylinder cannot be expressed 
analytically and thus is computationally expensive. Fortunately, in [1] two ap- 
proximations gDisk< and gmsk> of gmsk are given for the cases R/a < T$ and 
R/a > T<f>, respectively (using a threshold of T<p = 1 to switch between the 
cases). Note that the two approximations are generally not continuous at the 
threshold value T$. However, for our model fitting approach a continuous and 
smooth model function is required (see Sect. 3 for details). Therefore, based on 
these two approximations, we have developed a combined model using a Gaus- 
sian error function as a blending function such that for all ratios R/a always the 
approximation with the lower approximation error is used. The blending func- 
tion has two fixed parameters for controlling the blending effect, i.e. a threshold 
T$ which determines the ratio R/a where the approximations are switched and 
a standard deviation er<j> which controls the smoothness of switching. We deter- 
mined optimal values for both blending parameters (see Sect. 2.2 for details). 
The 3D cylindrical model can then be written as (using x = ( x , y, z) ) 



gCylinder (^, R^ U ) 



9Disk< (U R> u) 






(T <p 



R 

a 




gDisk>(l", R,a) & CT<3> 



R 

a 
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9Disk< (U R'i u) 

9Disk> (f i R-j U ) 



2 R 2 
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3 ° 2a 2 + x 2 + y 2 
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R 2 



2a 2 
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( 3 ) 

( 4 ) 

( 5 ) 

( 6 ) 



Fig. 2 shows ID cross-sections (for different ratios R/a) of the exact Gaus- 
sian smoothed cylinder gDisk (numerically integrated), the two approximations 
9Disk< and gDisk>, and our new model gCylinder- It can be seen that our model 
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Fig. 2. For different ratios of R/a = 1.0; 3.0; 8.0 (from left to right), the exact curve 
gDisk of a ID cross-section of a Gaussian smoothed disk is given (grey curve) as well 
as the approximations gmsk< and goisk> (dashed resp. dotted curve for the negative 
axis) and the new model gcyiinder (dashed curve for the positive axis). 

approximates the exact curve very well (see the positive axis). In addition, we 
include in our model the intensity levels a o (surrounding tissue) and ai (vessel) 
as well as a 3D rigid transform 1Z with rotation parameters a = (a,/3, y) T and 
translation parameters t = (xq, yo, Zo) T ■ This results in the parametric intensity 
model with a total of 10 parameters p = (I?, do, ai, a, a, /3, 7 , xo,yo, Zq): 



9m, C ylinder (^4 P) ^0 T (^1 ^ 0 ) dCylinder i/R (-^5 ^ 5 1') 5 R-i (f ) 



2.2 Optimal Values T# and er<p for the Blending Function 

In order to determine optimal values Tip and a<p for the blending function used 
in (3), we computed the approximation errors of the approximations gmsk< and 
9Disk> for different values of a = 0.38, 0.385, ..., 0.8 and fixed radius R = 1 
(note, we can fix R as only the ratio R/a is important). The approximation 
errors were numerically integrated in 2D over one quadrant of the smoothed 
disk (using Matlrematica) . From the results (see Fig. 3 left and middle) we found 
that the approximation errors intersect at a/R = 0.555 ± 0.005 in the Ll-norm 
and at a/R = 0.605 ± 0.005 in the L2-norm. We here chose the mean of both 
intersection points as threshold, i.e. Tp = 1/0.58 ss 1.72. It is worth mentioning 
that this value for T$ is much better than Tip = 1 originally proposed in [1]. For 
a<p we chose a value of 0.1. From further experiments (not shown here) it turns 
out that these settings give relatively small approximation errors in both norms. 
It nicely turns out (see Fig. 3 left and middle) that our model not only combines 
the more accurate parts of both approximations but also has a lower error in the 
critical region close to Tip, where both approximations have their largest errors. 

2.3 Analysis for Thin Structures 

For thin cylinders, i.e. R/a < Tp, our model gcyiinder is basically the same as 
the approximation which has the following remarkable property for some 

factor/ with 0 </< /I = yi+4^W: 

a gDisk<{r,R,o) = // gDisk</r, R' = fR. a’ - //tw R? (1 - / 2 )) (8) 
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Fig. 3. For different values of a = 0.38, 0.385, . . . , 0.8 and radius R = 1, the errors of 
the approximations gmsk< and gDisk> (dark resp. light gray) as well as the error of the 
new model gcylinder (black) are shown for the Ll-norm (left) and L2-norm (middle). 
The right diagram shows R'(f), cr'(f), and a' (/) for a varying factor / between 0 and 
fmax (for fixed R = 0.5, a = 1, a = 1). The vertical dashed line indicates the ratio 
/ = R/a = Tip, i.e. only the left part of the diagram is relevant for gmsk<- 

where a represents the contrast aq — ao of our model 5 m, Cylinder and a' = a/ f 2 . 
This means that this function is identical for different values of /, i.e. different 
settings of R'{f ), cr'(f), and a'(f) generate the same intensity structure. This 
relation is illustrated for one example in Fig. 3 (right). As a consequence, based 
on this approximation it is not possible to unambiguously estimate R, a, and a 
from intensities representing a thin smoothed cylinder. In order to uniquely es- 
timate the parameters we need additional information, i.e. a priori knowledge of 
one of the three parameters. With this information and the ambiguous estimates 
we are able to compute / and subsequently the remaining two parameters. 

Obviously, it is unlikely that we have a priori knowledge about the radius of 
the vessel as the estimation of the radius is our primary task. On the other hand, 
even relatively accurate information about the smoothing parameter a will not 
help us much as can be seen from (8) and also Fig. 3 (right): cr'(f) is not changing 
much in the relevant range of /. Therefore, a small deviation in a can result in a 
large deviation of / and thus gives an unreliable estimate for R. Fortunately, the 
opposite is the case for the contrast a'(f). For given estimates R and a as well as 
a priori knowledge about a, we can compute f = sja/a and R = R/ f = R \J a/a. 
For example, for an uncertainty of ±10% in the true contrast a the computed 
radius is only affected by ca. ±5%, and for an uncertainty of —30% to ±56% 
the computed radius is affected by less than 20%. Note, this consideration only 
affects thin vessels with a ratio R/a < T$ = 1.72, i.e. for typical values of a ss 1 
voxel and thus a radius below 2 voxels, the error in estimating the radius is 
below 0.2 voxels even for a large uncertainty of —30% to ±56%. 

We propose two strategies for determining a. In case we are segmenting a 
vessel with varying radius along the vessel, we can use the estimate of the contrast 
in parts of the vessel where R/a > T$ (here the estimates of the parameters are 
unique) for the other parts as well. In case of a thin vessel without thicker parts 
we could additionally segment a larger close-by vessel for estimating the contrast, 
assuming that the contrast is similar in this region of the image. 

Standard approaches for vessel segmentation based on a Gaussian function 
(e.g., [8], [7], [5]) only estimate two parameters: the image contrast a g and a stan- 
dard deviation a g . Assuming that the image intensities are generated by a Gaus- 
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Fig. 4. Estimated radius R for 102 segments of a smoothed straight 3D cylinder with 
settings R = 2, a = 1, ao = 50, and ai = 150 as well as added Gaussian noise (a n = 10). 
In addition, one 2D cross-section of the 3D synthetic data is shown. 

sian smoothed cylinder based on gmsk < , we can write a g = 2aR 2 /{Aa 2 + R 2 ) 
and <j g = y/ 4a 2 + R 2 / 2, see (4). Often, the radius of the vessel is defined by the 
estimated standard deviation a g , which implies that a = R\J 3/2 holds. However, 
this is generally not the case and therefore leads to inaccurate estimates of R. 

3 Incremental Vessel Segmentation and Quantification 

To segment a vessel we utilize an incremental process which starts from a given 
point of the vessel and proceeds along the vessel. In each increment, the parame- 
ters of the cylinder segment are determined by fitting the cylindrical model in (7) 
to the image intensities g(x) within a region-of-interest (ROI), thus minimizing 

SxeROI {9M, Cylinder (x, P )~9 (x)) 2 (9) 

For the minimization we apply the method of Levenberg-Marquardt, incorpo- 
rating 1st order partial derivatives of the cylindrical model w.r.t. the model 
parameters. The partial derivatives can be derived analytically. The length of 
the cylinder segment is defined by the ROI size (in our case typically 9-21 vox- 
els) . Initial parameters for the fitting process are determined from the estimated 
parameters of the previous segment using a linear Kalman filter, thus the incre- 
mental scheme adjusts for varying thickness and changing direction. 

4 Experimental Results 

4.1 3D Synthetic Data 

In total we have generated 388 synthetic 3D images of straight and curved tubu- 
lar structures using Gaussian smoothed discrete cylinders and spirals (with dif- 
ferent parameter settings, e.g., for the cylinders we used radii of R = 1, ... ,9 
voxels, smoothing values of a = 0.5, 0.75, . . . , 2 voxels, and a contrast of 100 grey 
levels). We also added Gaussian noise ( a n = 0,1,3,5,10,20 grey levels). From 
the experiments we found that the approach is quite robust against noise and 
produces accurate results in estimating the radius as well as the other model 
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Fig. 5. Differences of the estimated radius (mean over ca. 99 segments) and the true 
radius for a synthetic straight cylinder with different radii R = 1, . . . , 9 for the un- 
calibrated (left) and calibrated Gaussian line model (middle), as well as for the new 
cylindrical model (right). The dashed lines highlight the interval from -0.1 to 0.1 voxels. 




Fig. 6. Segmentation results of applying the cylindrical model to 3D synthetic data 
of a spiral (left) and a screw- like spiral (right). For visualization we used 3D Slicer [4]. 



parameters (i.e. contrast and image smoothing as well as 3D position and orien- 
tation). As an example, Fig. 4 shows the estimated radius for 102 segments of a 
relatively thin smoothed cylinder. The correct radius could be estimated quite 
accurately within ±0.06 voxels along the whole cylinder. Fig. 5 (right) shows the 
differences of the estimated radius to the true radius of smoothed cylinders for 
a range of different radii (for er = 1 and a n = 10). It can be seen that the error 
in the estimated radius is in all cases well below 0.1 voxels. As a comparison we 
also applied a standard approach based on a 3D Gaussian line. To cope with the 
general limitations of the Gaussian line approach (see Sect. 2.3), we additionally 
calibrated the estimated radius (assuming an image smoothing of a = 1, see [5] 
for details). It can be seen that the new approach yields a significantly more 
accurate result in comparison to both the uncalibrated and calibrated Gaussian 
line approach (Fig. 5 left and middle). Fig. 6 shows segmentation results of our 
new approach for a spiral and a screw-like spiral (for a radius of R = 2 voxels) . 
It turns out that our new approach accurately segments curved structures of 
varying curvature, i.e. the estimated radius is within ±0.1 voxels to the true 
radius for nearly all parts of the spirals. Larger errors only occur for the last 
part of the innermost winding, where the curvature is relatively large. 
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Fig. 7. Segmentation results of applying the new cylindrical model to arteries of the 
pelvis (left and middle) as well as to coronary arteries and the aorta (right). 

4.2 3D Medical Images 

With our approach both position and shape information (radius) are estimated 
from 3D images. Fig. 7 shows segmentation results of applying the new cylin- 
drical model to 3D MRA images of the human pelvis and heart. Note that for 
the segmentation of the vessel trees we used starting points at each bifurcation. 
It can be seen that arteries of quite different sizes and high curvatures are suc- 
cessfully segmented. As a typical example, the computation time for segmenting 
an artery of the pelvis (see Fig. 7 left, main artery in left branch including the 
upper part) using a radius of the ROI of 10 voxels is just under 4min for a total 
of 760 segments (on a AMD Athlon PC with 1.7GHz, running Linux). 

5 Discussion 

The new 3D cylindrical intensity model yields accurate and robust segmentation 
results comprising both position and thickness information. The model allows 
to accurately segment 3D vessels of a large spectrum of sizes, i.e. from very thin 
vessels (e.g., a radius of only 1 voxel) up to relatively large arteries (e.g., a radius 
of 14 voxels for the aorta) . Also, we pointed out general limitations in the case of 
thin structures and disadvantages of approaches based on a Gaussian function. 
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Abstract. Image segmentation based on graph representations has been 
a very active field of research recently. One major reason is that pairwise 
similarities (encoded by a graph) are also applicable in general situations 
where prototypical image descriptors as partitioning cues are no longer 
adequate. In this context, we recently proposed a novel convex program- 
ming approach for segmentation in terms of optimal graph cuts which 
compares favorably with alternative methods in several aspects. 

In this paper we present a fully elaborated version of this approach along 
several directions: first, an image preprocessing method is proposed to 
reduce the problem size by several orders of magnitude. Furthermore, 
we argue that the hierarchical partition tree is a natural data structure 
as opposed to enforcing multiway cuts directly. In this context, we ad- 
dress various aspects regarding the fully automatic computation of the 
final segmentation. Experimental results illustrate the encouraging per- 
formance of our approach for unsupervised image segmentation. 



1 Introduction 

The segmentation of images into coherent parts is a key problem of computer 
vision. It is widely agreed that in order to properly solve this problem, both 
data-driven and model-driven approaches have to be taken into account [1]. 

Concerning the data-driven part, graph-theoretical approaches are more suit- 
ed for unsupervised segmentation than approaches working in Euclidean spaces: 
as opposed to representations based on (dis-)similarity relations, class represen- 
tations based on Euclidean distances (and variants) are too restrictive to capture 
signal variability in low-level vision [2] . This claim also appears to be supported 
by research on human perception [3] . 

The unsupervised partitioning of graphs constitutes a difficult combinato- 
rial optimization problem. Suitable problem relaxations like the mean-field ap- 
proximation [4,5] or spectral relaxation [6,7] are necessary to compromise about 
computational complexity and quality of approximate solutions. 

Recently, a novel convex programming approach utilizing semidefinite re- 
laxation has shown to be superior regarding optimization quality, the absence 
of heuristic tuning parameters, and the possibility to mathematically constrain 
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Fig. 1 . A color image from the Berkeley segmentation dataset [9] (left). Comparing 
the segmentation boundaries calculated with the semidefinite programming relaxation 
(right) to the human segmentations (middle), the high quality of the SDP relaxation 
result is reflected by a high F-measure (see Section 5) of 0.92. 

segmentations, at the cost of an increased but still moderate polynomial com- 
putational complexity [8]. This motivates to elaborate this approach towards 
a fully automatic and efficient unsupervised segmentation scheme providing a 
hierarchical data structure of coherent image parts which, in combination with 
model-based processing, may be explored for the purpose of scene interpretation 
(see Fig. 1 for an example result). 

To this end, we consider a hierarchical framework for the binary partitioning 
approach presented in [8] to obtain a segmentation into multiple clusters (Section 
2). To reduce the problem size by several orders of magnitude (to less than 
0.01% of the all-pixel-based graph), we discuss an over-segmentation technique 
[10] which forms coherent “superpixels” [11] in a preprocessing step (Section 3). 
Section 4 treats various aspects concerning the development of a fully automatic 
unsupervised segmentation scheme. Experimental results based on a benchmark 
dataset of real world scenes [9] and comparisons with the normalized cut criterion 
illustrate the encouraging performance of our approach (Section 5). 



2 Image Segmentation via Graph Cuts 

The problem of image segmentation based on pairwise affinities can be formu- 
lated as a graph partitioning problem in the following way: consider the weighted 
graph G(V, E) with locally extracted image features as vertices V and pair- 
wise similarity values w,j G Rj as edge-weights. Segmenting the image into 
two parts then corresponds to partitioning the nodes of the graph into dis- 
joint groups S and S = V \ S. Representing such a partition by an indica- 
tor vector x € {— l,+l} n (where n = |V|), the quality of a binary segmen- 
tation can be measured by the weight of the corresponding cut in the graph: 
cut(,S', S) = jes u ’ij = \x T Lx, where L = D — W denotes the graph 

Laplacian matrix, and D is the diagonal degree matrix with Du = w n- 

As directly minimizing the cut favors unbalanced segmentations, several 
methods for defining more suitable measures have been suggested in the lit- 
erature. One of the most popular is the normalized cut criterion [7], which tries 
to avoid unbalanced partitions by appropriately scaling the cut-value. Since the 
corresponding cost function yields an NP-hard minimization problem, a spectral 
relaxation method is used to compute an approximate solution which is based on 
calculating minimal eigenvectors of the normalized Laplacian L' = D~z LD~z . 
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To get a binary solution of the original problem, these eigenvectors are then 
tlrreslrolded appropriately. 

SDP relaxation. In this paper, we employ an alternative technique to find 
balanced partitions which originates from spectral graph theory [6] . As a starting 
point consider the following combinatorial problem formulation: 

min x 1 Lx 

xe{-i,+i} n (1) 

s.t. c T x = b. 

Thus, instead of normalizing the cut- value as in [7], in this case an additional 
balancing constraint c T x = b is used to compute favorable partitions. A classical 
approach to find a balanced segmentation uses c = (1, . . . , 1) T and 6 = 0, which is 
reasonable for graphs where each vertex is equally important. However, this may 
not be the case for the preprocessed images considered here; we will therefore 
discuss alternative settings for c and b in Section 4. 

In order to find an approximate solution for the NP-lrarcl problem (1), an 
advanced method is proposed in [8] which in contrast to spectral relaxation is 
not only able to handle the general linear constraint, but also takes into account 
the integer constraint on x in a better way. Observing that the cut-weight can 
be rewritten as x T Lx = tr (Lxx'), the problem variables are lifted into a higher 
dimensional space by introducing the matrix variable X = xx T . Dropping the 
rank one constraint on X and using arbitrary positive semidehnite matrices 
X y 0 instead, we obtain the following relaxation of (1): 

min tr (LX) 
xyo 

s.t. tr(cc T X) = b 2 (2) 

tr (eiej X) = 1 for * = 1 , . . . , n, 

where e.; € M™ denotes the i-th unit vector (see [8] for details). 

The important point is that (2) belongs to the class of semidefinite programs 
(SDP), which can be solved in polynomial time to arbitrary precision, without 
needing to adjust any additional tuning parameters (see, e.g., [12]). To finally 
recover an integer solution x from the computed solution matrix X of (2), we use 
a randomized approximation technique [13]. Since this method does not enforce 
the balancing constraint from (1), it rather serves as a strong bias to guide the 
search instead of a strict requirement (cf. [8]). 

Hierarchical clustering. In order to find segmentations of the image into 
multiple parts, we employ a hierarchical framework (e.g. [14]). In contrast to di- 
rect multiclass techniques (cf. [15,16]), the original cost function is used through- 
out the segmentation process, but for different (and usually smaller) problems 
in each step. As a consequence, the number k of segments does not need to be 
defined in advance, but can be chosen during the computation (which is more 
feasible for unsupervised segmentation tasks). Moreover, the subsequent splitting 
of segments yields a hierarchy of segmentations, so that changing k leads to sim- 
ilar segmentations. However, as no global cost function is optimized, additional 
decision critera are needed concerning the selection of the next partitioning step 
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Fig. 2. 304 image patches are obtained for the image from Fig. 1 by over-segmenting it 
with mean shift. Note that in accordance with the homogeneous regions of the image, 
the patches differ in size. In this way, the splitting of such regions during the hierarchical 
graph cut segmentation is efficiently prevented. 

and when to stop the hierarchical process. We will consider such criteria in Sec- 
tion 4. 

3 Reducing the Problem Size 

One important issue for segmentation methods based on graph representations 
is the size of the corresponding similarity matrix. If the vertex set V contains 
the pixels of an image, the size of the similarity matrix is equal to the squared 
number of pixels, and therefore generally too large to fit into computer memory 
completely (e.g. for an image of 481 x 321 pixels — the size of the images from the 
Berkeley segmentation dataset [9] — the similarity matrix contains 154401 2 ~ 
23.8 billion entries). As reverting to sparse matrices (which works efficiently for 
spectral methods) is of no avail for the SDP relaxation approach, we suggest to 
reduce the problem size in a preprocessing step. While in this context, approaches 
based on probabilistic sampling have recently been applied successfully to image 
segmentation problems [17,18], we propose a different technique. 

Over-segmentation with mean shift. Our method is based on the 
straightforward idea to abandon pixels as graph vertices and to use small image 
patches (or “superpixels” [11]) of coherent structure instead. In fact, it can be 
argued that this is even a more natural image representation than pixels as those 
are merely the result of the digital image discretization. The real world does not 
consist of pixels! 

In principle, any unsupervised clustering technique could be used as a pre- 
processing step to obtain such image patches of coherent structure. We apply 
the mean shift procedure [10], as it does not smooth over clear edges and results 
in patches of varying size (see Fig. 2 for an example). In this way, the important 
structures of the image are maintained, while on the other hand the number of 
image features for the graph representation is greatly reduced. 

In summary, the mean shift uses gradient estimation to iteratively seek modes 
of a density distribution in some Euclidean feature space. In our case, the feature 
vectors comprise the pixel positions along with their color in the perceptually 
uniform L*u*v* space. The number and size of the image patches is controlled by 
scaling the entries of the feature vectors with the spatial and the range bandwidth 
parameters h s and h r , respectively (see [10] for details). 

In order to get an adequate problem size for the SDP relaxation approach, 
we determine these parameters semi-automatically: while the spatial bandwidth 
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h s is set to a fixed fraction of the image size, we calculate the range bandwidth 
h r by randomly picking a certain number of pixels from the image, computing 
their maximum distance d max in the L*u*v* color space, and setting h r to a 
fraction of d max . Moreover, we fix the minimum size of a region to M = 50 
pixels. For the images from the Berkeley dataset [9], experiments showed that 
setting h s = 5.0 and h r = results in an appropriate number of 100-700 
image patches (corresponding to less than 0.01% of the number of pixels). 

Constructing the graph. Using the image patches obtained with mean 
shift as graph vertices, the corresponding affinities are defined by representing 
each patch i with its mean color y-i in L*u*v* space, and calculating the similarity 
weights Wij between neighboring patches as Wy = Uj exp , where Zy 

denotes the length of the edge in the image between the patches i and j. Hence, 
the problem is represented by a locally connected graph. 

Assuming that each pixel originally is connected to its four neighbors, the 
multiplication with l t j simulates a standard coarsening technique for graph par- 
titioning [14]: the weight between two patches is calculated as the sum of the 
weights between the pixels contained within these patches. As each patch is of 
largely homogeneous color, using the mean color yi instead of exact pixel colors 
does not change the resulting weights considerably. 

Note that additional cues like texture or intervening contours can be incor- 
porated into the classification process by computing corresponding similarity 
values based on the image patches, and combining them appropriately (see e.g. 
[14,19]). However, we do not consider modified similarities here. 



4 Towards a Fully Automatic Segmentation 

Based on the image patches obtained with mean shift the SDP relaxation ap- 
proach is applied hierarchically to successively find binary segmentations. While 
solving the relaxation itself does not require tuning any parameters, the hierar- 
chical application necessitates to discuss strategies for building up the segmen- 
tation tree, which is the subject of this section. 

Segmentation constraints. Concerning the balancing constraint c T x = b 
in (1), the graph vertices represented by the entries of c now correspond to image 
patches of varying size. For this reason, we calculate the number of pixels nii 
contained in each patch i and set Cj = to, instead of c* = 1, while retaining 
b = 0. In this way, the SDP relaxation seeks for two coherent parts with each 
containing approximately the same number of pixels. 

However, if the part of the image under consideration in the current step 
contains a dominating patch k with Ck = max^ Cj Cj for all j Y k, segmentation 
into equally sized parts may not be possible. Nevertheless, we can still produce 
a feasible instance of the SDP relaxation in this case by adjusting the value of 
b in (1) appropriately, e.g. to 6 = c* — | YYi^k c *- Note that such an adjustment 
is not possible for spectral relaxation methods! 

Which segment to split next? This question arises after each binary par- 
titioning step. As the goal of unsupervised image segmentation mainly consists 
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in capturing the global impression of the scene, large parts of coherent structure 
should always be preferred to finer details. For this reason, we generally select 
the largest existing segment as the next candidate to be split. 

However, we allow for two exceptions to this general rule: (1) If the candidate 
segment contains less than a certain number of patches (which we set to 8 in our 
experiments), it is not split any further. This prevents dividing the image into 
too much detail. (2) If the cut-value obtained for the candidate segment is too 
large, this split is rejected, since this indicates that the structure of this segment 
is already quite coherent. To decide when a cut- value 2 is too large, we compare 
it against the sum of all edge-weights w' (which is an upper bound on z): only 
if z is smaller than 2% of w', the corresponding split is accepted. 

Stopping criteria. The probably most difficult question in connection to 
unsupervised image segmentation concerns the number of parts the image con- 
sists of, or equivalently, when to stop the hierarchical segmentation process. As 
every human is likely to answer this question differently, one could even argue 
that without defining the desired granularity, image segmentation becomes an 
ill-posed problem. The hierarchical SDP relaxation approach offers two possible 
stopping criteria based on the desired granularity: the first one directly defines 
the maximum number of parts for the final segmentation. The second one is 
based on the fact that adding the cut-values results in an increasing function 
depending on the step number, which is bounded above by w' . Therefore, we can 
introduce the additional criterion to stop the hierarchical segmentation process 
when the complete cut value becomes larger than a certain percentage of w' . 

5 Experimental Results 

To evaluate the performance of our hierarchical segmentation algorithm, we ap- 
ply it to images from the Berkeley segmentation dataset [9], which contains 
images of a wide variety of natural scenes. Moreover, this dataset also provides 
“ground-truth” data in the form of segmentations produced by humans (cf. Fig. 
1), which allows to measure the performance of our algorithm quantitatively. 
Some exemplary results are depicted in Fig. 3. These encouraging segmenta- 
tions are computed in less than 5 minutes on a Pentium 2 GHz processor. 

As a quantitative measure of the segmentation quality, we use the precision- 
recall framework presented in [19]. In this context, the so-called F-measure is a 
valuable statistical performance indicator of a segmentation that captures the 
trade-off between accuracy and noise by giving values between 0 (bad segmen- 
tation) and 1 (good segmentation). For the results shown in Fig. 3, the corre- 
sponding F-measures confirm the positive visual impression. 

For comparison, we also apply the normalized cut approach within the same 
hierarchical framework with identical parameter settings. While the results indi- 
cate the superiority of the SDP relaxation approach, this one-to-one comparison 
should be judged with care: as the normalized cut relaxation cannot appropri- 
ately take into account the varying patch sizes, the over-segmentation produced 
with mean shift may not be an adequate starting point for this method. 
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Fig. 3. Segmentation results for four color images (481 x 321 pixels) from the Berkeley 
segmentation dataset [9] . Note the superior quality of the segmentations obtained with 
the SDP relaxation approach in comparison to the normalized cut relaxation, which 
are approved by the higher F-measures. 
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Fig. 4. Evolution of the hierarchical segmentation for the image from Fig. 1. Note the 
coarse-to-fine nature of the evolution: First the broad parts of the image (water and 
sky) are segmented, while the finer details of the surfer arise later. 

Finally, Fig. 4 gives an example of how the segmentation based on the SDP 
relaxation evolves hierarchically. In this context, note that although the water 
contains many patches (cf. Fig. 2), it is not split into more segments since the 
corresponding cut-values are too large. 

6 Conclusion 

We presented a hierarchical approach to unsupervised image segmentation which 
is based on a semidefinite relaxation of a constrained binary graph cut problem. 
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To prevent large homogeneous regions from being split (a common problem of 
balanced graph cut methods) we computed an over-segmentation of the image 
in a preprocessing step using the mean shift technique. Besides yielding better 
segmentations, this also reduced the problem size by several orders of magnitude. 

The results illustrate an important advantage of the SDP relaxation in com- 
parison to other segmentation methods based on graph cuts: As the balancing 
constraint can be adjusted to the current problem, we can appropriately take 
into account the different size of image patches. Moreover, it is easy to include 
additional constraints to model other conditions on the image patches, like con- 
nections to enforce the membership of certain patches to the same segment. We 
will investigate this aspect of semi-supervised segmentation in our future work. 
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Abstract. This paper proposes an efficient pairwise surface matching 
approach for the automatic assembly of 3d fragments or industrial com- 
ponents. The method rapidly scans through the space of all possible solu- 
tions by a special kind of random sample consensus (RANSAC) scheme. 
By using surface normals and optionally simple features like surface cur- 
vatures, we can highly constrain the initial 6 degrees of freedom search 
space of all relative transformations between two fragments. The sug- 
gested approach is robust, very time and memory efficient, easy to im- 
plement and applicable to all kinds of surface data where surface normals 
are available (e.g. range images, polygonal object representations, point 
clouds with neighbor connectivity, etc.). 



1 Introduction 

The problem of surface matching is an important computer vision task with 
many applications. The field of application comprises object recognition, pose 
estimation, registration or fusion of partially overlapping range images or vol- 
umes, protein docking, reconstruction of broken objects like archaeologic arti- 
facts and reduction of bone fractures in computer-aided surgery. The common 
challenge is to find a rigid geometric transformation which aligns the surfaces 
in an optimal way. This paper focuses on the reconstruction of broken arbitrary 
objects by matching their fragments. Nevertheless, the proposed method is easy 
adaptable to all other applications mentioned above. The problem of matching 
complementary fragments of broken objects is largely similar to the problem of 
matching partially overlapping surfaces (e.g. range images). The main common 
difficulties are: 

— The search space has six degrees of freedom (dof). 

— Digitized surfaces often are inaccurate and noisy. 

— Large data sets, i.e. a very high number of points in 3d. 

In the case of broken objects there are some additional difficulties: 

— A good initial guess, that can be used to iterate to the global minimum, 
generally is not available. 
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— Very large surface areas without correspondence at the complementary part. 

— Object intersections must be avoided. 

— Material deterioration. 



1.1 Related Work 

An outline of all publications dealing with registration techniques would go be- 
yond the scope of this paper. Therefore we will only give an overview of the 
most relevant work. A very popular surface registration approach is the iterative 
closest point (ICP) algorithm [1]. The approach iteratively improves an initial 
solution by finding closest point pairs and subsequently calculating the relative 
transformation which aligns the two point sets in terms of least square error [2] . 
Although many enhancements to the original method have been suggested, they 
still require a good initial guess to find the global minimum. Many approaches 
are using local surface features to find corresponding point pairs. Features vary 
from simple properties like surface normals or curvatures to complex vectors like 
point signatures [3], surface curves e.g.[4], or spin-images [5]. However, their use 
cannot guarantee unique point correspondences; nevertheless they can highly 
constrain the search space. 

A well-known category dealing with object recognition and localization are 
the pose clustering approaches (also known as hypothesis accumulation or gener- 
alized Hough transform e.g. [6], [7], [8]) The basic idea is to accumulate low level 
pose hypotheses in a voting table, followed by a maxima search which identifies 
the most frequented hypotheses. The drawback of voting tables is their relative 
high time and space complexity, particularly in case of large data sets. 

A simple and robust approach for fitting models, like lines and circles in 
images, is the RANSAC algorithm introduced in [9]. The repeated procedure is 
simple but powerful: First, a likely hypothesis is generated by random (uniform 
distribution) from the input data set. Subsequently, the quality of the hypothesis 
(e.g. number of inliers) is evaluated. The method has been applied to a wide range 
of computer vision problems. The most related work [10] applies the RANSAC 
scheme to the registration of partially overlapping range images with a resolution 
of 64x64 pixels. The approach highly depends on the fact, that a range image 
can be treated as a projection of 3d points onto an index plane; this is why it 
is not applicable to the general case. Moreover, it does not take advantage of 
surface normals or any other features. 

Many papers address matching of two-dimensional fragments like jigsaw 
puzzles or thin-walled fragments e.g. [11]. The problem of matching comple- 
mentary three-dimensional object fragments (including the consideration of 
undesired fragment intersections) is rarely treated in the open literature. One of 
the recent approaches in this field of research is based on a pose error estimation 
using z-buffers of each fragment for each hypothesized pose [12]. The error 
minimization is performed by simulated annealing over a 7 clof search space of 
all possible poses of two fragments in relation to a separating reference plane. 
Although the approach makes use of the simulated annealing optimization, it 
degenerates to an exhaustive search if the optimal matching direction is limited 
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to a small angular range, e.g. in case of matching a nut with a bolt. Another dis- 
advantage is that the computation of each z-buffer has a time complexity of O(n). 



2 Method Overview 

Our suggested approach rapidly finds the ’interesting regions’ in the space of all 
possible relative rigid transformations of two fragments. Transformations where 
the parts do not get in contact are not considered at all. Moreover a desired 
surface contact is characterized by a preferably large touching area without a 
significant fragment penetration. Therefore the quality of a contact situation is 
determined by the size of the overlapping surface area and the absence of sur- 
face penetration, which is treated with a penalty. According to the RANSAC 
principle, the proposed method generates likely contact poses by random (see 
Sect. 2.1) and estimates the matching quality of each pose by an efficient combi- 
nation of fast forecasting and by the use of a binary search tree (see Sect. 2.2). 
The basic procedure consists of the following steps: 

1. Select a random surface point pi £ A and a random surface point pj £ B. 
A surface contact at pi and pj constrains three degrees of freedom. Another 
two degrees of freedom are fixed by the respective vertex normals n Pi and 
n Pd , which are directed against each other. Only the rotation around these 
normals remains unknown. 

Optional: Select only pairs which satisfy feature(pi) ~ feature(pj). 

2. Select a random second point pair (</,; £ A, qj £ B) which forms a fully 
constrained two-point contact pose together with the first pair ( Pi,Pj )• 
Optional: Select only pairs which satisfy feature(qi) ~ f eature(qj) . 

3. Estimate the matching quality of the pose. 

Efficiency highly can be increased by using a fast forecasting technique com- 
bined with a dropout if the expected result is considerably worse than the 
last best match. 

4. Repeat steps 1-3 and memorize the best match until the matching quality is 
good enough or a time limit is exceeded. 

Here A C 5ft 3 denotes the whole set of surface points of one fragment ( B respec- 
tively). It is obvious, that the consideration of local surface properties or features 
can increase the likelihood that a pair corresponds to each other. In our experi- 
ments we compare the mean curvature of a point pair, which enables us to reject 
up to 90% as unsuitable. Curvatures can be obtained from all kinds of data sets 
which provide neighbor connectivity (see e.g. [13] for curvature computation on 
triangle meshes). 

The strength of the algorithm is that it is independent of the fragment shapes. 
The efficiency is directly proportional to the size of compatible surface area 
(fracture interface) and independent of the tolerance of matching direction, etc.; 
thus the method is even applicable in case of components and assemblies with 
narrow mating directions (like nut and bolt or plug and socket) where approaches 
which iterate to a local minimum may fail. 
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Fig. 1. Schematic illustration of the spin-table coordinates 



2.1 Rapid Generation of Two-Point Contact Poses 

a random pose where the fragments get in contact and the surfaces touch each 
other tangentially (contrariwise surface normals ±n at contact point p). Since 
this step only generates random surface point pairs, it is very fast and simple. 
The second step tries to find a second point pair ( qi,qj ) which can be brought 
into contact by rotating around an axis with origin p and direction n. A point 
pair which fulfills this condition must have an equal distance h to the tangent 
plane (see Fig. 1) and an equal distance r to the rotation axis. Referring to [5] 
we will call the parameter pair (r*, hi) the spin-table coordinates of a point qi in 
relation to its rotation origin p. The following algorithm alternately adds point 
references of A and B into their respective spin-table M a , Alb until a point pair 
with equal spin-table coordinates has been found, or until a counter exceeds the 
limit k , which is a strong indication for a contact situation with marginal surface 
overlap. 

1. Clear spin-tables M a := 0, Mb := 0. 

2. Repeat k times 

3. Select a random point qi £ A 

4. Calculate its spin-table coordinates (ri, hi) with respect to p 

5. Insert qi into its respective spin-table AI a (n,hi) := qi 

6. Read out the opposite spin-table qj := Mb(ri,hi) 

7. If ( qj yf 0) terminate loop; the new contact pair is (qi,qj) 

8. Repeat step 3-7 with reversed roles of A, B and M a , Alb 

9. End-repeat 

We achieve a good trade-off between accuracy and execution time with a table 
size of 64x64. The basic procedure can be improved by accepting only contact 
pairs (qi, qj) with compatible features and roughly contrariwise surface normals 
n q i,n q j. Furthermore, the efficiency can be increased by reusing one of both 
filled spin-table Mi of a point pi for multiple assumed contact points pj on the 
counterpart. 

2.2 Fast Quality Estimation of a Given Pose 

After generating a hypothesis we must measure its matching quality. For this 
we estimate the proportion of overlapping area 17 where surface A is in contact 
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with the opposite surface B in the given pose. We assume that the surfaces 
are in contact at areas where the distances between surface points are smaller 
than e. ft can also be regarded as the probability that a random point x £ A 
is in contact with the opposite surface B. Thus ft can be forecasted by an 
efficient Monte-Carlo strategy using a sequence of random points, combined 
with a dropout if the expected result is considerably worse than the last best 
match. Suppose that x±, . . . ,x n are independent random points with Xi £ A. 
Then ft is given by 



„ v Sl-i contact B (xi) 
U = lim — 

n—t oo 71 



(i) 



where contacts (%) is a function which determines whether point x is in contact 
with surface B 



contact B (x) = j q lStB ^ < 6 (2) 

and dist B {x) is a function which returns the minimal distance of a point x to 
surface B 

dist B (x) = min \x — y\ (3) 

ye B 

Our implementation of this function is based on a kd-tree data structure (see 
[15]) and therefore offers a logarithmical time complexity for the closest point 
search. 

In contrast to Eq. (1) it is only possible to try a limited number n of random 
points; thus ft only can be approximated up to an arbitrary level of confidence. 
(Notice that this restriction only holds if A and B contain an infinite number of 
points). Considering the margin of error, for every additional random point, the 
approximation of ft can be recomputed as 



^ contact B (xj) 1.96 

n 2 y/n 



(4) 



with a 95% level of confidence. If the upper bound of this range is worse than 
the last best match, we abort the computation and try the next hypothesis. 

Up to this point we have not considered surface penetrations at all. To ensure 
that the fragments do not penetrate each other we simply subtract a penalty for 
penetrating points (i.e. points which are more than e ’below’ the complementary 
fragment surface). This can be done by using an alternative contact rating 



{ 1 if dist B {x ) < £ (at surface) 

—4 if dist B (x) > £ A (x — y) ■ n v < 0 (inside) (5) 
0 else (outside) 

Where y £ B denotes the closest point to x , and n y denotes the surface normal 
of y. Due to the fact that y can be determined simultaneous with the calculation 
of the minimal distance in Eq. (3), it is straightforward to test whether x lies 
inside or outside of fragment B. The execution time can be accelerated by an 
approximative early bound of the kd-tree search. 
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Fig. 2. Some matching results: (a) plug and socket; (b) broken boulder; (c) broken 
venus; (d) broken Stanford bunny; (e) broken bone model; (f) broken pelvis 



2.3 Matching Constraints for Special Types of Fractures 

Up to this point, we did not use any knowledge about the object. However, in 
many applications, additional information about the parts to be combined are 
available and can be used to constrain the search space. This includes information 
about the object shape (e.g. [16]), geometric features like a sharp curvature 
transition from an intact surface to a broken one (e.g. [4]), or knowledge of the 
roughness of the fracture. An integration of these constraints into our approach 
is straightforward and can drastically limit the search space and thus reduce the 
execution time. Particularly suitable for this matching procedure is the previous 
knowledge of surface points at the fracture interface which must be brought into 
contact with the counterpart. These ’contact constraints’ can highly increase the 
efficiency of the contact pose generation of Sect. 2.1. A detailed discussion about 
these additional constraints is subject of further publications. 

3 Experimental Results and Conclusion 

The method has been evaluated on high-resolution triangle meshes of artificial 
and digitized objects with computer generated and real fractures. All computer 
generated fractures are distorted with some additive Gaussian noise and it is 
assured that the tesselation of one fragment is unrelated to its counterpart. 
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Table 1. Evaluation of 100 passes per experiment. The entries are: Total number of 
vertices; area of fracture interface (Overlap); added noise level in percent referring to 
the maximal object radius; lower limit (LRL), upper limit (URL), and mean execution 
time /r in seconds, and standard deviation of execution time a 





Vertices 


Overlap 


Add. Noise 


LRL 


URL 


P 


a 


plug(a) 


41772 


29% 


0.0% 


0.09 


10.28 


2.52 


2.85 








1.0% 


0.21 


154.31 


46.51 


43.92 








1.5% 


0.33 


186.27 


48.66 


45.27 








2.0% 


2.15 


263.96 


64.94 


63.86 


boulder(b) 


21068 


38% 


0.1% 


0.11 


22.58 


4.85 


4.53 


venus (c) 


51837 


28% 


0.1% 


0.31 


27.42 


6.39 


5.79 


bunny(d) 


96170 


30% 


0.1% 


0.41 


64.31 


11.61 


10.68 


bone(e) 


35503 


13% 


0.0% 


0.08 


114.62 


28.01 


28.83 


pelvis(f) 


57755 


1% 


0.0% 


- 


- 


- 


- 




Fig. 3. (a) Example with 2% noise distortion and smoothing result; (b) broken pelvis 
and the ’constraint matching’ result 



Fig. 2 shows some representative test examples with different noise levels. To 
make the calculation of surface normals and curvatures robust against noise we 
apply a common mesh smoothing filter [13]. The left side of Fig. 3 visualizes 
the intensity of our maximal noise distortion, and the effect of smoothing. Some 
of our data sets are real human bone fractures (femurs and pelvis), which are 
extracted from computer tomographic scans. The medical relevance of a robust 
bone registration method is outlined in [16]. The required execution times for 
successful assembly of these fragments are listed in Table 1. The tests were 
performed on an AMD Athlon XP/1466MHz based PC. As can be seen, the 
method performs well with a variety of objects (3d fragments as well as industrial 
components) . The desired matching accuracy is implicitly regulated by the input 
parameter e (maximal contact distance of Eq. (2)) and the minimal matching 
quality 12. In the unconstraint case the algorithm always finds the solution with 
the largest touching area, which is not always the desired one. This occurs if the 
portion of fracture interface is small in comparison to the total surface area, and 
if the intact object surface includes large smooth areas with similar regions on 
the counterpart (e.g. the pelvis in Fig. 2). In these cases additional constraints 
(which are discussed in Sect. 2.3) are indispensable to prevent the algorithm from 
finding ” trivial solutions” . The desired solution of the pelvis (see the right side 
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of Fig. 3) can be achieved by constraining the matching direction, or by selecting 
some vertices at the fractured surface which must be brought into contact with 
the counterpart. In further publications we will discuss usage of these additional 
constraints. 
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Abstract. Due to the increasing interest in 3D models in various appli- 
cations there is a growing need to support e.g. the automatic search or 
the classification in such databases. As the description of 3D objects is 
not canonical it is attractive to use invariants for their representation. We 
recently published a methodology to calculate invariants for continuous 
3D objects defined in the real domain R 3 by integrating over the group 
of Euclidean motion with monomials of a local neighborhood of voxels 
as kernel functions and we applied it successfully for the classification 
of scanned pollen in 3D. In this paper we are going to extend this idea 
to derive invariants from discrete structures, like polygons or 3D-meshes 
by summing over monomials of discrete features of local support. This 
novel result for a space-invariant description of discrete structures can be 
derived by extending Haar integrals over the Euclidean transformation 
group to Dirac delta functions. 



1 Introduction 

Invariant features are an elegant way to solve the problem of e.g. space invariant 
recognition. The idea is to find a mapping T which is able to extract intrinsic 
features of an object, i.e., features that stay unchanged if the object’s position 
and/or orientation changes. Such a transformation T necessarily maps all images 
of an equivalence class of objects under the transformations group G into one 
point of the feature space: 

Xl £x 2 => T(xi) = T(x 2 ) . (1) 

A mapping T which is invariant with respect to G is said to be complete if T is 
a bijective mapping between the invariants and the equivalence classes, i.e. if we 
additionally ask 

T(xi) = T(x 2 ) => xi ~ x 2 . (2) 

C.E. Rasmussen et al. (Eds.): DAGM 2004, LNCS 3175, pp. 137—144, 2004. 

(c) Springer- Verlag Berlin Heidelberg 2004 
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For a given gray scale image X and a kernel function /(X) it is possible to con- 
struct an invariant feature T[/](X) by integrating /(pX) over the transformation 
group G: 

T[/](X) := j f(gX)dg. (3) 

G 

As kernel functions / we typically use monomials from a local pixel- or voxel- 
neighborhood (FLSs Functions of Local Support). These Haar integrals are de- 
fined for continuous objects in R 2 or R 3 ([4, 5, 6, 2, 3]). This integral can in practice 
only be evaluated for compact groups, which means that the parameters describ- 
ing the group lie in a finite region. In the sequel we will call these invariants His 
(Haar Invariants). 





Fig. 1. Discrete structures in 2D and 3D: (a) closed contour described by a polygon (b) 
wireframe object (c) 3D triangulated surface mesh (d) molecule. 



For discrete objects A (see Fig. 1) which vanish almost everywhere this inte- 
gral delivers, however, trivial or zero results (depending on the kernel function). 
Discrete structures can be described by Dirac delta functions (examples of dis- 
tributions or generalized functions), which are different from zero on a set of 
measure zero of the domain. However, properly chosen integrals of these delta 
functions exist and deliver finite values. We will use this concept to define proper 
Haar invariants from discrete structures (DHIs Discrete Haar Invariants). 

There exist a vast literature on invariant-based shape representations and 
object recognition problems; space limitations do not allow here a thorough 
review. However, to the best of the authors’ knowledge there exist no similar 
approaches to the DHIs. 

2 Invariants for Discrete Objects 

For a discrete object A and a kernel function /(A) it is possible to construct an 
invariant feature T[/](A) by integrating f(g A) over the transformation group 
g G G. Let us assume that our discrete object is different from zero only at 
its vertices. A rotation and translation invariant local discrete kernel function 
h takes care for the algebraic relations to the neighboring vertices and we can 
write 
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/( A ) = £M A > x i)<H x - x i) > ( 4 ) 

iev 

where V is the set of vertices and x t the vector representing vertex i. 

In order to get finite values from the distributions it is necessary to intro- 
duce under the Haar integral another integration over the spatial domain X. By 
choosing an arbitrary integration path in the continuous group G we can visit 
each vertex in an arbitrary order the integral is transformed into a sum over all 
local discrete functions allowing all possible permutations of the contributions of 
the vertices. As the discrete neighborhood functions are attached to the vertices 
they are already invariant to G, i.e. h(gA.,gXi) = h( A. x, j and hence we get 



T[f }( A ) 



J J f{gA)dxdg 

G X 



/ 




h{g&,gx t )6(gx- 



gx.i)dx 



dg 




( 5 ) 



Therefore we get invariants by simply adding local Euclidean-invariant (rotation 
and translation) DFLS h( A,x,;) (DFLSs Discrete Functions of Local Support) 
over all vertices. The interpretation of the result is very obvious: summing over 
locally Euclidean-invariants provides also global Euclidean invariants. 



2.1 Invariants for Polygons 

Let us apply the general principles for DHIs for polygons. As discrete functions of 
local support (DFLS) we choose monomials of the distances between neighboring 
vertices (which are obvious invariant to rotation) up to degree k = 4 in the 
following example: 

= £ M A , x i) = £ ( 6 ) 

iev iev 

and the di^ denote the Euclidean distance of vertex i and its fc-tlr riglrtlrand 
neighbor: 

di,k = || x i -x <i+fe> || . (7) 

The brackets <> denote a reduction modulo the number of vertices of a contour. 

By varying the exponents we can build up a feature space. The nonlinear 
nature of this monomial-based kernel functions endow themselves with rich be- 
haviors in the sense of discriminative ability. In general, the discrimination per- 
formance of this feature space will increase with the number of used features. 



Principle of rigidity It can easily be shown that the features , d^ 2 > ^ 1 , 3 } (see 
Fig. 2) uniquely define a complete polygon (up to a mirror-polygon) because we 
can iteratively construct the polygon by rigidly piecing together rigid triangles 
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Fig. 2. (a) Example of a polygon (b) basic features {dip, di,2, cZi,3 } 



(hereby we do not allow consecutive collinear edges in a polygon). Therefore we 
call these elements a basis. We expect to get a more and more complete feature 
space by integrating over more and more monomials from these basic features. 

Looking at a triangle as the most simplest polygon one can show that the 
following three features derived from the three sides {a, b 1 c} form a complete set 
of invariants: 

xo = a+b + c, x±=a 2 + b 2 + c 2 , X 2 = a 3 + b 3 + c 3 . (8) 

These features are equivalent to the elementary symmetrical polynomials in 3 
variables which are a complete set of invariants with respect to all permutations. 

It is not our intention to compare the results here with Fourier Descrip- 
tors (see e.g. [1]). First the approaches are rather different (integration over the 
transformation group versus normalization technique) and second the proposed 
method is much easier to extend from 2D to 3D. 

2.2 3D-Meshes 

Although we will not elaborate a rigorous concept for 3D discrete objects we 
want to outline that it is straightforward to extend the concept e.g to a 30- 
Surface mesh or a real 3D wireframe model. Again we derive DFLS of a certain 
neighborhood and sum over all vertices. It is appropriate to use here also basic 
features which meet the requirement that they constitute rigid local polylredra 
which can rigidly pieced together to a solid surface or polyhedral object (see 
Fig. 3). Similar to the triangle as a basic building block for a planar polygon 
we can use a tetrahedron as a basic building block for a polyhedron. And as we 
can find three invariants for a triangle we similarly can find invariants for a 
tetrahedron derived from its edge lengths. 

2.3 Discrimination Performance and the Question of Completeness 

A crucial question is how many features we need to get a good discrimination 
performance and avoid ambiguities up to the point to use a complete set of 
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neighbourhood of 
degree one 

neighbourhood of 
degree two 



Fig. 3. Part of a surface mesh with neighborhood of degree one and two. 



features. Asking for completeness is a very strong demand because it guarantees 
that no ambiguities between two different objects exist. For a practical pattern 
recognition problem we have to solve the much easier problem of separability to 
discriminate between a finite number of representatives of object classes (like 
the 26 classes in character recognition). Therefore we are typically content with 
a finite number of invariants which is typically much less than the number of 
features needed for completeness. 




Fig. 4. Topologically equivalent (TE) and non topologically equivalent discrete struc- 
tures (NTE). 



As we use discrete features the methodology is sensitive to topological 
changes in the structure (see Fig. 4). So if we introduce another vertex on a 
polygon in the middle of an edge for example the invariants will clearly change. 
Discrete similarities are not necessarily visual similarities. Introducing an atom 
into a molecule also has a high impact on the function. 

3 Experiments: Object Classification in a Tangram 
Database 

We will demonstrate the construction of DHIs for the simplest case of closed poly- 
gons. As an example we have chosen a subset of 74 samples from the Tangram- 
Man-Database 1 . Fig. 5 shows a selection of objects. As characteristic feature we 

1 see ’’Tangram Man” at http://www.reijnhoudt.nl/tangram 
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Fig. 5. The 74 tangrams used in the experiment. 



extract the outer contour of the objects. This example is interesting because the 
objects can not easily be discriminated with trivial geometric features, as they 
all have the same area and their contours build clusters of rather few possible 
numbers. 

As noise we added to each vertex a vector of a certain length with an uni- 
formly distributed angle from [0 : 2-7r] . The amount of noise was measured against 

the standard deviation of all edges of the polygon er = || x i — x|| 2 . Fig. 6 

shows an example with 10% noise added which leads already to remarkable dis- 
tortions. 

As the calculation of the monomials is a continuous function of the coordi- 
nates of the vertices we henceforth get a continuous propagation of the additive 
noise through the invariants. Therefore: little distortions will also cause little 
changes in the magnitude of the invariants. 




Fig. 6. Exact Contour and with 10% noise added. 
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Table 1. Exponent table for monomials built from f° r calculating 14 

invariants 
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Xq 
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X8 
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1 
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1 
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0 
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1 


0 


0 


2 


0 


714 


0 


0 


0 


0 


0 


0 


1 


1 


1 


1 


0 


0 


0 


2 



Table 2. Classification result for 5%, 10% and 20% noise for 74 tangrams with an 
Euclidean (E) and a Mahalanobis (M) Classifier 



noise (in percent) 


5 


5 


5 


10 


10 


10 


20 


20 


20 


# of invariants 


6 


10 


14 


6 


10 


14 


6 


10 


14 


metric 


E 


E 


E 


M 


M 


M 


M 


M 


M 


class, error (in percent) 


30 


10 


6 


1.5 


0 


0 


25 


7 


3 



The experiments were conducted with three sets of 6, 10 and 14 invariants 
respectively according to eq. (6) (see Table 1) (the subset of 6 and 10 are just the 
first 6 and 10 of the 14 invariants. The classification performance was measured 
against additive noise of 5%, 10% and 20%. 

Table 2 shows the result of our experiments averaged over 50 samples. Choos- 
ing for the sake of simplicity an Euclidean classifier measuring the distance to 
the expectation of the class centers leads to rather bad classification results. 
Adding only 5% noise results in a misclassification of 30% of the tangrams using 
6 invariants. This number can be reduced to 10% false classifications adding 
further 4 invariants and to 6% using 14 invariants. This result is not surprising. 
Looking at the invariants we can observe a large variance of their magnitudes 
due to the differences in the degree of the monomials. We can, however, dras- 
tically improve the result by using a Malralanobis-Classifier. This is due to the 
fact that the noise covariance matrices of all objects are very similar. Therefore 
we made experiments with a Mahalanobis-Classifier based on an averaged co- 
variance matrix over all object classes. Now even an increase of noise by a factor 
two to 10% leads only to 1.5% errors with 6 invariants and 0% errors for 10 and 
14 invariants which demonstrates the very good performance of the invariants. 
Even with 20% noise and 14 invariants we get a rather low error rate of only 
3%. 

If we constrain our calculation to a finite number of invariants we end up 
with a simple linear complexity in the number of vertices. This holds if the 
local neighborhood of vertices is resolved already by the given data structure; 
otherwise the cost for resolving local neighborhoods must be added. In contrast 
to graph matching algorithms we apply here algebraic techniques to solve the 
problem. This has the advantage that we can apply hierarchical searches for 
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retrieval tasks, namely, to start only with one feature and hopefully eliminate 
already a large number of objects and then continue with an increasing number 
of features etc. 

4 Conclusions 

In this paper we have introduced a novel set of invariants for discrete structures 
in 2D and 3D. The construction is a rigorous extension of Haar integrals over 
transformation groups to Dirac Delta Functions. The resulting invariants can 
easily be calculated with linear complexity in the number of vertices. The pro- 
posed approach has the potential to be extended to other discrete structures and 
even to the more general case of weighted graphs. 
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Abstract. The goal of our work is object categorization in real-world scenes. 
That is, given a novel image we want to recognize and localize unseen-before 
objects based on their similarity to a learned object category. For use in a real- 
world system, it is important that this includes the ability to recognize objects at 
multiple scales. 

In this paper, we present an approach to multi-scale object categorization using 
scale-invariant interest points and a scale-adaptive Mean-Shift search. The ap- 
proach builds on the method from [12], which has been demonstrated to achieve 
excellent results for the single-scale case, and extends it to multiple scales. We 
present an experimental comparison of the influence of different interest point 
operators and quantitatively show the method's robustness to large scale changes. 



1 Introduction 

Many current object detection methods deal with the scale problem by performing an 
exhaustive search over all possible object positions and scales [17,18,19]. This exhaustive 
search imposes severe constraints, both on the detector’s computational complexity and 
on its discriminance, since a large number of potential false positives need to be excluded. 
An opposite approach is to let the search be guided by image structures that give cues 
about the object scale. In such a system, an initial interest point detector tries to find 
structures whose extend can be reliably estimated under scale changes. These structures 
are then combined to derive a comparatively small number of hypotheses for object 
locations and scales. Only those hypotheses that pass an initial plausibility test need to 
be examined in detail. In recent years, a range of scale-invariant interest point detectors 
have become available which can be used for this purpose [13,14,15,10]. 

In this paper, we apply this idea to extend the method from [12,11]. This method has 
recently been demonstrated to yield excellent object detection results and high robustness 
to occlusions [11]. However, it has so far only been defined for categorizing objects at 
a known scale. In practical applications, this is almost never the case. Even in scenarios 
where the camera location is relatively fixed, objects of interest may still exhibit scale 
changes of at least a factor of two simply because they occur at different distances to 
the camera. Scale invariance is thus one of the most important properties for any system 
that shall be applied to real-world scenarios without human intervention. 

This paper contains four main contributions: (1) We extend our approach from [ 12, 
1 1] to multi-scale object categorization, making it thus usable in practice. Our extension 
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Table 1 . Comparison of results on the UIUC car database reported in the literature. 



Method 


Agarwal et al [1] 


Garg et al [9] 


Leibe et al [11] 


Fergus et al [8] 


Our approach 


Equal Error Rate 


~79% 


~88% 


97.5% 


88.5% 


91.0% 


Scale Inv. 


no 


no 


no 


yes 


yes 



is based on the use of scale-invariant interest point detectors, as motivated above. (2) 
We formulate the multi-scale object detection problem in a Mean-Shift framework, 
which allows to draw parallels to Parzen window probability density estimation. We 
show that the introduction of a scale dimension in this scheme requires the Mean- 
Shift approach to be extended by a scale adaption mechanism that is different from the 
variable-bandwidth methods proposed so far [6,4], (3) We experimentally evaluate the 
suitability of different scale-invariant interest point detectors and analyze their influence 
on the recognition results. Interest point detectors have so far mainly been evaluated in 
terms of repeatability and the ability to find exact correspondences [15,16]. As our task 
requires the generalization to unseen objects, we are more interested in finding similar 
and typical structures, which imposes different constraints on the detectors. (4) Last 
but not least, we experimentally evaluate the robustness of the proposed approach to 
large scale changes. While other approaches have used multi-scale interest points also 
for object class recognition [7,8], no quantitative analysis of their robustness to scale 
changes has been reported. Our results show that the proposed approach outperforms 
state-of-the-art methods while being robust to scale changes of more than a factor of 
two. In addition, our quantitative results allow to draw some interesting conclusions for 
the design of suitable interest point detectors. 

The paper is structured as follows. The next section discusses related work. After 
that, we briefly review the original single-scale approach. Section 3 then describes our 
extension to multiple scales. In Section 4, we examine the influence of different interest 
point detectors on the recognition result. Finally, Section 5 evaluates the robustness to 
scale changes. 



2 Related Work 

Many current methods for detection and recognition of object classes learn global or 
local features in fixed configurations or using configuration classifiers [17,18,19]. They 
recognize objects of different sizes by performing an exhaustive search over scales. 
Other approaches represent objects by more flexible models involving hand-defined 
or learned object parts. [20] models the joint spatial probability distribution of such 
parts, but does not explicitly deal with scale changes. [8] extends this approach to learn 
scale-invariant object parts and estimates their joint spatial and appearance distribution. 
However, the complexity of this combined estimation step restricts the method to a 
small number of parts. [7] also describes a method for selecting scale-invariant object 
parts, but this method is currently defined only for part detection, not yet on an object 
level. Most directly related to our approach, [1] learns a vocabulary of object parts for 
recognition and applies a SNoW classifier on top of them (which is later combined with 
the output of a more global classifier in [9]). [3] learns a similar vocabulary for generating 
class-specific segmentations. Both approaches only consider objects at a single scale. 
Our approach combines both ideas and integrates the two processes of recognition and 
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figure-ground segmentation into a common probabilistic framework [12,1 1], which will 
also be the basis for our scale-invariant system. The following section briefly reviews 
this approach. As space does not permit to give a complete description, we only highlight 
the most important points and refer to [12,1 1 ] for details. 



2.1 Basic Approach 

The variability of a given object category is represented by learning, in a first step, a class- 
specific codebook of local appearances. For this, fixed-size image patches are extracted 
around Harris interest points from a set of training images and are clustered with an 
agglomerative clustering scheme. We then learn the spatial distribution of codebook 
entries for the given category by storing all locations the codebook entries were matched 
to on the training objects. During recognition, this information is used in a probabilistic 
extension of the Generalized Hough Transform [2, 1 4] . Each patch e observed at location 
l casts probabilistic votes for different object identities o n and positions x according to 
the following equation: 

p(o n ,x\e,£) = p(o n ,x\IiJ)p(Ii\e ), ( 1 ) 



where p(/j|e) denotes the probability that patch e matches to codebook entry /j, and 
p(o n , x\Ii, t) describes the stored spatial probability distribution for the object center 
relative to an occurrence of that codebook entry. In [12], object hypotheses are found as 
maxima in the voting space using a fixed-size search window W : 

score(o n ,x ) = E E piPm Xj ■ (2) 

k Xj£W(x) 



For each such hypothesis, we then obtain the per-pixel probabilities of each pixel being 
figure or ground by the following double marginalization, thus effectively segmenting 
the object from the background (again see [12,1 1] for details): 



P( P =figure\o n ,x) 



E 

pe(e,«) 



^2,p{P=fig-\o n ,x, I,£) 

I 



p{o n ,x\I,i)p{I\e)p(eJ) 

p(o n ,x) 



(3) 



The per-pixel probabilities are then used in an MDL-based hypothesis verification stage 
in order to integrate only information about the object and discard misleading influences 
from the background [11], The resulting approach achieves impressive results (as a 
comparison with other methods in Tab. 1 shows), but it has the inherent limitation that 
it can only recognize objects at a known scale. In practical applications, however, the 
exact scale of objects is typically not known beforehand, and there may even be several 
objects with different scales in the same scene. In order to make the approach applicable 
in practice, it is thus necessary to achieve scale-invariant recognition. 



3 Extension to Multiple Scales 

A major point of this paper is to extend recognition to multiple scales using scale- 
invariant interest points. The basic idea behind this is to replace the single-scale Harris 
codebook used up to now by a codebook derived from a scale-invariant detector. Given 
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an input image, the system applies the detector and obtains a vector of point locations, 
together with their associated scales. Patches are extracted around the detected locations 
with a radius relative to the scale a of the interest point (here: r = 3a). In order to match 
image structures at different scales, the patches are then rescaled to the codebook size 
(in our case 25 x 25 pixels). 

The probabilistic framework can be readily extended to multiple scales by treating 
scale as a third dimension in the voting space. If an image patch found at location 
(ximg,yimg, Simg) matches to a codebook entry that has been observed at position 
( x oca Doca s occ ) on a training image, it votes for the following coordinates: 



%vote %img %occ\Simg / Socc. 

Vvote — Viang Vocc^irng/^occ) 

Svote — ($img / $occ) 



(4) 

(5) 

( 6 ) 



However, the increased dimension of the voting space makes the maxima search com- 
putationally more expensive. For this reason, we employ a two-stage search strategy. In 
a first stage, votes are collected in a binned 3D Hough accumulator array in order to 
quickly find local maxima. Candidate maxima from this first stage are then refined in 
the second stage using the original (continuous) 3D votes. Instead of a simple but ex- 
pensive sliding-window technique, we formulate the search in a Mean-Shift framework. 
For this, we replace the simple search window W from equation (2) by the following 
kernel density estimate: 

p(o n ,x)= ^'Y^2,p{o n ,x j \t k ,i k )K{ X * 3 ) (7) 

k j 



where the kernel K is a radially symmetric, nonnegative function, centered at zero and 
integrating to one. From [5], we know that a Mean-Shift search using this formulation 
will quickly converge to local modes of the underlying distribution. Moreover, the search 
procedure can be interpreted as a Parzen window probability density estimation for the 
position of the object center. 

From the literature, it is also known that the performance of the Mean-Shift procedure 
depends critically on a good selection for the kernel bandwidth h. Various approaches 
have been proposed to estimate the optimal bandwidth directly from the data, e.g. [6,4]. 
In our case, however, we have an intuitive interpretation for the bandwidth as a search 
window for the position of the object center. As the object scale increases, the relative 
errors introduced by equations (4)-(6) cause votes to be spread over a larger area around 
the hypothesized object center and thus reduce their density in the voting space. As a 
consequence, the kernel bandwidth should also increase in order to compensate for this 
effect. We can thus make the bandwidth dependent on the scale coordinate and obtain 
the following balloon density estimator [6]: 



p{o n ,x ) 



1 

nh(x) d 



EE p{Om Xj j ^ ^ ) 



( 8 ) 



For K we use a uniform spherical kernel with a radius corresponding to 5% of the 
hypothesized object size. Since a certain minimum bandwidth needs to be maintained 
for small scales, though, we only adapt it for scales greater than 1 .0. 

We have thus formulated the multi-scale object detection problem as a scale-adaptive 
Mean-Shift search procedure. Our experimental results in Section 5 will show that this 
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Fig. 1 . Scale-invariant interest points found by (from left to right) the exact DoG, the fast DoG, the 
regular Harris-Laplace, and the fast Harris-Laplace detector on two example images (The smallest 
scales are omitted in order to reduce clutter). 

scale adaptation step is indeed needed in order to provide stable results over large scale 
changes. The performance of the resulting approach depends on the capability of the 
underlying patch extractor to find image structures that are both typical for the object 
category and that can be accurately localized in position and scale. As different detectors 
are optimized for finding different types of structures, the next section evaluates the 
suitability of various scale-invariant interest point detectors for categorization 



4 Influence of Interest Point Detectors 

Typically, interest point detectors are only evaluated in terms of their repeatability. Con- 
sequently, significant effort has been spent on making the detectors discriminant enough 
that they find exactly the same structures again under different viewing conditions. How- 
ever, we strongly believe that the evaluation should be in the context of a task. In our 
case, the task is to recognize and localize previously unseen objects of a given category. 
This means that we cannot assume to find exactly the same structures again; instead the 
system needs to generalize and find structures that are similar enough to known object 
parts while still allowing enough flexibility to cope with variations. Also, because of the 
large intra-class variability, more potential matching candidates are needed to compen- 
sate for inevitable mismatches. Last but not least, the interest points should provide a 
sufficient cover of the object, so that it can be recognized even if some important parts 
are occluded. Altogether, this imposes a rather different set of constraints on the interest 
point detector. As a first step we therefore have to compare the performance of different 
interest point operators for the categorization task. 

In this work, we evaluate two different types of scale-invariant interest point oper- 
ators: the Harris-Laplace detector [15], and the DoG (Difference of Gaussian) detector 
[14], Both operators have been shown to yield high repeatability [16], but they differ in 
the type of structures they respond to. The Harris-Laplace prefers corner-like structures 
by searching for multi-scale Harris points that are simultaneously extrema of a scale- 
space Laplacian, while the DoG detector selects blob-like structures by searching for 
scale-space maxima of a Difference-of-Gaussian. For both detectors, we additionally 
examine two variants: a regular and a speed-optimized implementation (operating on a 
Gaussian pyramid). Figure 1 shows the kind of structures that are captured by the differ- 
ent detectors. As can already be observed from these examples, all detectors manage to 
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Fig. 2. Performance comparison on the UIUC database, (left) Precision-Recall curves of different 
interest point detectors for the single-scale case, (right) EER performance over scale changes 
relative to the size of the training examples. 



capture some characteristic object parts, such as the car’s wheels, but the range of scales 
and the distribution of points over the object varies considerably between them. 

In order to obtain a more quantitative assessment of their capabilities, we compare 
the different interest point operators on a car detection task using our extended approach. 
As a test set, we use the UIUC database [1], which consists of 170 images containing a 
total of 200 sideviews of cars. For all experiments reported below, training is done on a 
set of 50 hand-segmented images (mirrored to represent both car directions). In a first 
stage, we compare the recognition performance if the test images are of the same size as 
the training images. Since our detectors are learned at a higher resolution than the cars 
in the test set, we rescale all test images by the same factor prior to recognition. (Note 
that this step does not increase the images’ information content.) 

Figure 2(left) shows a comparison of the detectors’ performances. It can be seen that 
the single-scale Harris codebook from [1 1] achieves the best results with 97.5% equal 
error rate (EER). Compared to its performance, all scale-invariant detectors result in 
codebooks that are less discriminant. This could be expected, since invariance always 
comes at the price of reduced discriminance. However, the exact DoG detector reaches 
an EER performance of 91%, which still compares favorably to state-of-the-art methods 
(see Tab. 1). The fast DoG detector performs only slightly worse with 89% EER. In 
contrast, both Harris-Laplace variants are notably inferior with 59.5% for the regular 
and 70% for the speed-optimized version. 

The main reason for the poorer performance of the Harris-Laplace detectors is that 
they return a smaller absolute number of interest points on the object, so that a sufficient 
cover is not always guaranteed. Although previous studies have shown that the Harris- 
Laplace points are more discriminant individually [7], their smaller number is a strong 
disadvantage. The DoG detectors, on the other hand, both find enough points on the 
objects and are discriminant enough to allow reliable matches to the codebook. They 
are thus better suited for our categorization task. For this reason, we only consider DoG 
detectors in the following experiments. 
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0.4 1.0 2.2 0.4 1.0 2.2 

Fig. 3. (top) Visualization of the range of scales tested in the experiments, and the corresponding 
car detections. Training has been performed at scale 1.0.; (bottom) Segmentations automatically 
obtained for these examples, (white: figure, black: ground, gray: not sampled) 



5 Robustness to Scale Changes 



We now analyze the robustness to scale changes. In particular, we are interested in the 
limit to the detectors’ performance when the scale of the test images is altered by a large 
factor and the fraction of familiar image structures is thus decreased. Rather than to test 
individual thresholds, we therefore compare the maximally achievable performance by 
looking at how the equal error rates are affected by scale changes. 

In the following experiment, the UIUC database images are rescaled to different 
sizes and the performance is measured as a function of the scaling factor relative to the 
size of the training examples. Figure 2(right) shows the EER performances that can be 
achieved for scale changes between factor 0.4 (corresponding to a scale reduction of 
1:2.5) and factor 2.2. When the training and test images are approximately of the same 
size, the single-scale Harris codebook is highly discriminant and provides the superior 
performance described in the previous section. However, the evaluation shows that it is 
only robust to scale changes up to about 20%, after which its performance quickly drops. 
The exact-DoG codebook, on the other hand, is not as discriminative and only achieves 
an EER of 91% for test images of the same scale. However, it is far more robust to scale 
changes and can compensate for both enlargements and size reductions of more than a 
factor of 2. Up to a scale factor of 0.6, its performance stays above 89%. Even when the 
target object is only half the size of those seen during training, it still provides an EER of 
85%. For the larger scales, the performance gradation is similar. The fast DoG detector 
performs about 10% worse, mainly because its implementation with a Gaussian pyramid 
restricts the number and precision of points found at higher scales. Figure 2(right) also 
shows that the system’s performance quickly degrades without the scale adaptation step 
from Section 3, confirming that this step is indeed important. 

An artifact of the interest point detectors can be observed when looking at the per- 
formance gradation over scale. Our implementation of the exact DoG detector estimates 
characteristic scale by computing three discrete levels per scale octave [14] and interpo- 
lates between them using a second-order polynomial. Correspondingly, recognition per- 
formance is highest at scale levels where structure sizes can be exactly computed (namely 
{0.6, 1.0, 1.3, 1.6, 2.0}, which correspond to powers of (\^ 2 )). In-between those levels, 
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the performance slightly dips. Although this effect can easily be alleviated by using more 
levels per scale octave, it shows the importance of this design decision. 

Figure 3 shows a visualization of the range of scales tested in this experiment. 
Our approach’s capability to provide robust performance over this large range of image 
variations marks a significant improvement over [ 1 1 ] . In the bottom part of the figure, the 
automatically generated segmentations are displayed for the different scales. Compared 
to the single-scale segmentations from [11], the segmentation quality is only slightly 
inferior, while being stable over a wide range of scales. 



6 Conclusion and Future Work 

In this paper, we have presented a scale invariant extension of the approach from [12,1 1] 
that makes the method applicable in practice. By reformulating the multi-scale object 
detection problem in a Mean-Shift framework, we have obtained a theoretically founded 
interpretation of the hypothesis search procedure which allows to use a principled scale 
adaptation mechanism. Our quantitative evaluation over a large range of scales shows 
that the resulting method is robust to scale changes of more than a factor of 2. In addition, 
the method retains the capability to provide an automatically derived object segmentation 
as part of the recognition process. 

As part of our study, we have also evaluated the suitability of different scale-invariant 
interest point detectors for the categorization task. One interesting result is that, while 
found to be more discriminant in previous studies [15,7], the Harris-Laplacian detector 
on its own does not detect enough points on the object to enable reliable recognition. 
The DoG detector, on the other hand, both finds enough points on the object and is dis- 
criminant enough to yield good recognition performance. This emphasizes the different 
characteristics the object categorization task brings with it, compared to the identifi- 
cation of known objects, and the consequent need to reevaluate design decisions. An 
obvious extension would be to combine both Harris-type and DoG-type interest points 
in a common system. Since both detectors respond to different image structures, they 
can complement each other and compensate for missing detections. Consequently, we 
expect such a combination to be more robust than the individual detectors. 
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Abstract. A fundamental problem in image recognition is to evaluate 
the similarity of two images. This can be done by searching for the best 
pixel-to-pixel matching taking into account suitable constraints. In this 
paper, we present an extension of a zero-order matching model called 
the image distortion model that yields state-of-the-art classification re- 
sults for different tasks. We include the constraint that in the matching 
process each pixel of both compared images must be matched at least 
once. The optimal matching under this constraint can be determined 
using the Hungarian algorithm. The additional constraint leads to more 
homogeneous displacement fields in the matching. The method reduces 
the error rate of a nearest neighbor classifier on the well known USPS 
handwritten digit recognition task from 2.4% to 2.2%. 



1 Introduction 

In image recognition, a common problem is to match two given images, e.g. 
when comparing an observed image to given references. In that process, different 
methods can be used. For this purpose we can define cost functions depending 
on the distortion introduced in the matching and search for the best matching 
with respect to a given cost function [6] . One successful and conceptually simple 
method for determining the image matching is to use a zero-order model that 
completely disregards dependencies between the pixel mappings. This model has 
been described in the literature several times independently and is called image 
distortion model (IDM) here. The IDM yields especially good results if the local 
image context for each pixel is considered in the matching process by using 
gradient information and local sub windows [5,6]. 

In this paper, we introduce an extension of the IDM that affects the pixel 
mapping not by incorporating explicit restrictions on the displacements (which 
can also lead to improvements [5,6]), but by adding the global constraint that 
each pixel in both of the compared images must be matched at least once. To 
find the best matching under this constraint, we construct an appropriate graph 
representing the images to be compared and then solve the ‘minimum weight 
edge cover’ problem that can be reduced to the ‘minimum weight matching’ 
problem. The latter can then be solved using the Hungarian algorithm [7]. The 
resulting model leads to more homogeneous displacement fields and improves 
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the error rate for the recognition of handwritten digits. We refer to this model 
as the Hungarian distortion model (HDM). 

The HDM is evaluated on the well known US Postal Service database (USPS), 
which contains segmented handwritten digits from US zip codes. There are many 
results for different classifiers available on this database and the HDM approach 
presented here achieves an error rate of 2.2% which is - though not being the 
best known result - state-of-the-art and an improvement over the 2.4% error 
rate achieved using the IDM alone. 

Related work. There is a large amount of literature dealing with the ap- 
plication of graph matching to computer vision and pattern recognition tasks. 
For example, graph matching procedures can be used for labeling of segmented 
scenes. Other examples, more related to the discussed method include the fol- 
lowing: In [9] the authors represent face images by elastic graphs which have 
node labels representing the local texture information as computed by a set of 
Gabor filters and are used in the face localization and recognition process. In [1] 
a method for image matching using the Hungarian algorithm is described that 
is based on representations of the local image context called ‘shape contexts’ 
which are only extracted at edge points. An assignment between these points 
is determined using the Hungarian algorithm and the image is matched using 
thin-plate splines, which is iterated until convergence. Yet, all applications of 
graph matching to comparable tasks that are known to the authors operate on a 
level higher than the pixel level. The novelty of the presented approach therefore 
consists in applying the matching at the pixel level. 

2 Decision Rule and Image Matching 

In this work, we focus on the invariant distance resulting from the image match- 
ing process and therefore only use a simple classification approach. We briefly 
give a formal description of the decision process: To classify a test image A with 
a given training set of references B \ k , . . . , Bx k k for each class k £ (1, . . . , K} 
we use the nearest neighbor (NN) decision rule 

r(A) = argmin{ min D(A,B nk )\, 

k n=l,... ,Nk 

i.e. the test image is assigned to the class of the nearest reference image. For 
the distance calculation the test image A = = 1,... ,/,j = 1,... ,J 

must be explained by a suitable deformation of the reference image B = 
{b xy },x = 1,... ,X,y = 1,... , Y. Here, the image pixels take [/-dimensional 
values ciij , b xy £ IR C/ , where the vector components are denoted by a superscript 
u. It has been observed in previous experiments that the performance of defor- 
mation models is significantly improved by using local context at the level of the 
pixels [5,6]. For example, we can use the horizontal and vertical image gradient 
as computed by a Sobel filter and/or local sub images that represent the im- 
age context of a pixel. Furthermore, we can use appropriately weighted position 
features (e.g. . . . ) that describe the relative pixel position in order to 

assign higher costs to mappings that deviate much from a linear matching. 
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We now want to determine an image deformation mapping (x[i,y{i) : 
( i,j ) i y ( ) that results in the distorted reference image B^ x ij y ij^ = 
{b X i jyi j }. The resulting cost given the two images and the deformation map- 
ping is defined as 

C(A,B,{x[i,y[())=Y J Y, \K-bl iVij \\ 2 , 

i,j u 



i.e. by summing up the local pixel-wise distances, which are squared Euclidean 
distances here. Now, the distance measure between images A and B is determined 
by minimizing the cost over the possible deformation mappings: 



D(A,B) 



min 



{c(A,B,{x[l,yll))) 



The set of possible deformation mappings A4 determines the type of model used. 
For the IDM these restrictions are Xij £ {1, ... , X} fl {i 1 — w, . . . ,i' + w}, i' = 
[*t], ytj e {!> • • • , Y }r\{j'-w, . . . ,j' + w}, j ' = [jy], with warp range w, e.g. 
w = 2. For different models, the minimization process can be computationally 
very complex. A preselection of the e.g. 100 nearest neighbors with a different 
distance measures like the Euclidean distance can then significantly improve the 
computation time at the expense of a slightly higher error rate. 



2.1 Image Distortion Model 

The IDM is a conceptually very simple matching procedure. It neglects all de- 
pendencies between the pixel displacements and is therefore a zero-order model 
of distortion. Although higher order models have advantages in some cases, the 
IDM is chosen here for comparison to the HDM since the Hungarian algorithm 
does not easily support the inclusion of dependencies between pixel displace- 
ments. The formal restrictions of the IDM are given in the previous section, a 
more informal description is as follows: for each pixel in the test image, deter- 
mine the best matching pixel within a region of size w x w at the corresponding 
position in the reference image and use this match. Due to its simplicity and 
efficiency this model has been introduced several times in the literature with 
differing names. When used with the appropriate pixel- level context descrip- 
tion it produces very good classification results for object recognition tasks like 
handwritten digit recognition [6] and radiograph classification [5]. 

2.2 Hungarian Matching 

The term ‘matching’ is a well-known expression in graph theory, where it refers to 
a selection of edges in a (bipartite) graph. We can also view the concept of pixel- 
to-pixel image matchings in this context. To do so, we construct an appropriate 
graph from the two images to be compared and apply the suitable algorithms 
known from graph theory. In this section we explore this application and use 
the so called Hungarian algorithm to solve different pixel-to-pixel assignment 
problems for images. The Hungarian algorithm has been used before to assign 
image region descriptors of two images to each other [1]. 
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Construction of the bipartite graph. The construction of the bipartite 
graph in the case discussed here is straight forward: Each pixel position of one 
of the two images to be compared is mapped to a node in the graph. Two nodes 
are connected by an edge if and only if they represent pixels from different 
images. This means that the two components of the bipartite graph represent 
the two images. The weight of an edge is chosen to be the Euclidean distance 
between the respective pixel representations, possibly enlarged by penalties for 
too large absolute distortions. 

Outline of the Hungarian algorithm. This outline of the Hungarian algo- 
rithm is included for the interested reader but it is not essential for the under- 
standing of the proposed method. The outline follows [7, pp. 74-89], which was 
also the basis for the used implementation. The name ‘Hungarian’ algorithm is 
due to a constructive result published by two Hungarian mathematicians in 1931 
that is used in the algorithm [7, p. 78]. 

To explain the basic idea, we assume that the weights of the edges are given 
by the entries of a matrix W and we assume that both components of the graph 
have N vertices and thus W £ M NxN is square. The goal of the algorithm is 
to find a permutation tt : {1, . . . , N } H > {1, . . . , N} minimizing ^2n=i Wtmt (™)- 
Now, we can make the following observations: 

(a) Adding a constant to any row or column of the matrix does not change the 
solution, because exactly one term in the sum is changed by that amount 
independent of the permutation. 

(b) If W is nonnegative and W n7r ( n ) = 0 then tt is a solution. 

Let two zeroes in W be called independent if they appear in different rows 
and columns. The algorithm now uses the following ‘Hungarian’ theorem: The 
maximum number of mutually independent zeroes in W is equal to the minimum 
number of lines (rows or columns) that are needed to cover all zeroes in W. 
Given an algorithm that finds such a maximum set of mutually independent 
zeroes and the corresponding minimum set of lines (as summarized below) the 
complete algorithm can be formulated as follows: 

1. from each line (row or column) subtract its minimum element 

2. find a maximum set of N' mutually independent zeroes 

3. if N 1 = N such zeroes have been found: output their indices and stop 
otherwise: cover all zeroes in W with N' lines and find the minimum uncov- 
ered value; subtract it from all uncovered elements, and add it to all doubly 
covered elements; go to 2 

To show that the algorithm always terminates and yields the correct result, 
it is necessary to illustrate how step 2 works. The detailed discussion of the 
termination is beyond the scope of this overview. We try to give a short idea 
and otherwise refer to [7]: 

1. Choose an initial set of independent zeroes (e.g. greedily constructed) and 
call these ‘special’. 2. Cover rows containing one of the special zeroes and mark 
all other rows. 3. While there are marked rows, choose the next marked row: for 
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each zero in the row that is not in a covered column, two cases are possible: a) the 
column already contains a special zero in another row ' p': cover the column and 
uncover and mark p. b) a new special zero is found and processed. When the 
row is processed completely, unmark it. 

Termination of the algorithm is guaranteed, because in step 3 either the 
number of mutually independent zeroes or the number of covered columns is 
increased by the newly introduced zero and this can happen at most N times. 
The total running time of this algorithm is 0(N 3 ), where the average case can be 
much lower if good initial assignments can be determined. This implies that the 
application of the HDM to large images is only possible at a high computational 
cost. Note that there are other algorithms to solve the assignment problem, but 
most of these algorithms are developed for special cases of the structure of the 
graph (which is always a complete bipartite graph here). 



Application of the Hungarian algorithm. The Hungarian algorithm is a 
tool to solve an assignment problem. For image matching, we can determine the 
best matching of pixels onto each other, where each pixel is matched exactly 
once. It is possible to directly use the Hungarian algorithm, but in many cases 
it is more appropriate to match the pixels onto each other such that each pixel 
is matched at least once or such that each pixel of the test image is matched 
exactly once. This last case corresponds to the most frequently used setting. We 
then require that the reference image explains all the pixels in the test image. 
We thus have three applications of the Hungarian algorithm for image matching: 

Each pixel matched exactly once. This case is trivial. Construct the weight 
matrix as discussed above and apply the Hungarian algorithm to obtain a 
minimum weight matching. 

Each pixel matched at least once. For this case, we need to solve the ‘min- 
imum weight edge cover’ problem. A reduction to the exact match case can 
be done following an idea presented in [3]: 

1 . construct the weight matrix as discussed above 2. for each node find one of 
the incident edges with minimum weight 3. subtract from each edge weight 
the minimum weight of both connected nodes as determined in the previ- 
ous step 4- make the edge weight matrix nonnegative (by subtracting the 
minimum weight) and apply the Hungarian algorithm 5. from the resulting 
matching, remove all edges with a nonzero weight (their nodes are covered 
better by using the minimum weight incident edges) 6. for each uncovered 
node add an edge with minimum weight to the cover 
Each pixel of the test image matched exactly once. This task is solved 
by the image distortion model, we only need to choose the best matching 
pixel for each pixel in the test image. 

Another method to obtain such a matching evolves from the previous algo- 
rithm if it is followed by the step: 7. for each pixel of the test image delete 
all edges in the cover except one with minimum weight. 

The resulting matching then does not have the overall minimum weight (as 
determined by the IDM) but respects larger parts of the reference image due 
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Fig. 1. Examples of pixel displacements; left: image distortion model; right: Hungarian 
distortion model. Top to bottom: grey values, horizontal, and vertical gradient; left to 
right: test image, distorted reference image, displacement field, and original reference 
image. The matching is based on the gradient values alone, using 3x3 local sub images 
and an absolute warp range of 2 pixels. 

to the construction of the matching. Therefore, the resulting matching is 
more homogeneous. In informal experiments this last choice showed the best 
performance and was used for the experiments presented in the following. 



3 Experiments and Results 

The software used in the experiments is available for download at 
http://www-i6.informatik.rwth-aachen.de/~gollan/w2d.html. We performed 
experiments on the well known US Postal Service handwritten digit recogni- 
tion task (USPS). It contains normalized greyscale images of size 16x16 pixels 
of handwritten digits from US zip codes. The corpus is divided into 7,291 train- 
ing and 2,007 test images. A human error rate estimated to be 1.5-2. 5% shows 
that it is a hard recognition task. A large variety of classification algorithms 



Table 1 . Best reported recognition results for the USPS corpus (top: general results 
for comparison; bottom: results related to the discussed method) 



method | ER[%] | 



invariant support vector machine 


[8] 


3.0 


extended tangent distance 


[6] 


2.4 


extended support vector machine 


[2] 


2.2 


local features + tangent distance 


[4] 


2.0 


ext. pseudo-2D HMM, local image context, 3-NN 


[6] 


1.9 



1 no matching, 1-NN 


5.6 


IDM, local image context, 1-NN 


[6] 


2.4 


HDM, local image context, 1-NN 


this work 


2.2 
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Fig. 2. Error rates on USPS vs. position weight and sub image size using HDM with 
greyvalues (preselection: 100 nearest neighbors, Euclidean distance). 




Fig. 3. Error rates on USPS vs. position weight and sub image size using HDM with 
gradients (preselection: 100 nearest neighbors, Euclidean distance). 



have been tried on this database and some of the best results are summarized 
in Table 1. 

Figure 1 shows two typical examples of pixel displacements resulting from 
IDM and HDM in the comparison of two images showing the digit ‘9’. It can be 
observed that the HDM leads to a significantly more homogeneous displacement 
field due to the additional restriction imposed in the calculation of the mapping. 

Figure 2 shows the error rate of the HDM with respect to the weight of 
the position feature in the matching process. The pixel features used are the 
grayvalue contexts of sizes lxl, 3x3, 5x5, and 7x7, respectively. Interestingly, 
already using only pixel greyvalues (lxl), the error rate can be somewhat im- 
proved from 5.6% to 5.0% with the appropriate position weight. Best results are 
obtained using sub images of size 3x3 leading to 3.2% error rate. 
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Figure 3 shows the error rate of the HDM with respect to the weight of the 
position feature using the vertical and horizontal gradient as the image features 
with different local contexts. Interestingly, the lxl error rate is very compet- 
itive when using the image gradient as features and reaches an error rate of 
2.7%. Again, best results are obtained using sub images of size 3x3 and position 
weights around 0.005 relative to the other features, with an error rate of 2.4%. 

All previously described experiments used a preselection of the 100 nearest 
neighbors with the Euclidean distance to speed up the classification process. 
(One image comparison takes about 0.1s on a 1.8GHz processor for 3x3 gradient 
contexts.) Using the full reference set in the classifier finally reduces the error 
rate from 2.4% to 2.2% for this setting. Note that this improvement is not 
statistically significant on a test corpus of size 2,007 but is still remarkable in 
combination with the resulting more homogeneous displacement fields. 

4 Conclusion 

In this paper, we extended the image distortion model which leads to state-of- 
the-art results in different classification tasks when using an appropriate repre- 
sentation of the local image context. The extension uses the Hungarian algorithm 
to find the best pixel-to-pixel mapping with the additional constraint that each 
pixel in both compared images must be matched at least once. This constraint 
leads to more homogeneous displacement fields in the matching process. The 
error rate on the USPS handwritten digit recognition task could be reduced 
from 2.4% to 2.2% using a nearest neighbor classifier and the IDM and HDM as 
distance measures, respectively. 
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Abstract. Features like junctions and corners are a rich source of infor- 
mation for image understanding. We present a novel theoretical frame- 
work for the analysis of such 2D features in scalar and multispectral 
images. We model the features as occluding superpositions of two dif- 
ferent orientations and derive a new constraint equation based on the 
tensor product of two directional derivatives. The eigensystem analysis 
of a 3 x 3-tensor then provides the so-called mixed-orientation parame- 
ters (MOP) vector that encodes the two orientations uniquely, but only 
implicitly. We then show how to separate the MOP vector into the two 
orientations by finding the roots of a second-order polynomial. Based on 
the orientations, the occluding boundary and the center of the junction 
are easily determined. The results confirm the validity, robustness, and 
accuracy of the approach. 



1 Introduction 

It is well known that corners and junctions are a rich source of information for 
image understanding: T-junctions are associated to object occlusions; L- and In- 
junctions to object corners; A -junctions to the occurrence of transparencies; and 
T-junctions to the presence of bending surfaces of objects [2,12]. Accordingly 
different approaches for junction localization have been reported [9, 11, 13, 14, 
17, 19,22], 

In addition to the above semantic importance of junctions and corners, their 
significance is determined by basic properties of the image function itself: flat 
regions in images are the most frequent but also redundant; one-dimensional 
features like straight edges are still redundant since two-dimensional regions 
have been shown to fully specify an image [4, 15]; corners and junctions are the 
least frequent but most significant image features [23]. 

In this paper we model an image junction as a superposition of oriented 
structures and show how to estimate the multiple orientations occurring at such 
positions. Our approach differs from previous attempts [10, 18, 19] in that we 
provide a closed-form solution. Moreover, our results are an extension of earlier 
results that have dealt with the problems of estimating transparent motions [16, 
21], occluded motions [6,5], and multiple orientations in images based on an 
additive model [20, 1]. 
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2 Theoretical Results 



Let f(x) be an image that is ideally oriented in a region 17, i.e. , there is a direction 
(subspace) E of the plane such that f (x+v) = f(x) for all x, v such that x, x+ 
v £ 17, v £ E. This is equivalent to 



— ^ - = 0 for all x £ 17 and v€fi, (1) 

ov 

which is a system of q equations for f(x) £ R 9 . For intensity images <7 = 1 and 
for RGB images q = 3. The direction E can be estimated as the set of vectors 
that minimize the energy functional 



£(v) 
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dv 
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d!2 = v T Jv , 



where J is given by 
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(3) 



In the above equation, f x , i y are short notations for di/dx, di/dy . 

The tensor J is the natural generalization of the structure tensor [7-9, 11] 
for multi-spectral images. Since J is symmetric and non-negative, Eq. (1) is 
equivalent to Jv = Av,v ^ 0, where ideally A = 0. This implies that E is the 
null-eigenspace of J and in practice estimated as the eigenspace associated to the 
smallest eigenvalues of J. Confidence for the estimation can thus be derived from 
the eigenvalues (or, equivalently, scalar invariants) of J (see [16]): two small 
eigenvalues correspond to flat regions of the image; only one small eigenvalue 
to the presence of oriented structures; and two significant eigenvalues to the 
presence of junctions or other 2D structures. Below we show how to estimate 
the orientations at junctions where two oriented structures predominate. 



2.1 Multiple Orientations 

Let 17 be a region of high confidence for a junction. We model junctions by the 
following constraint on f(x) that is the occluded superposition 

f(x) = x(x)gi(x) + (1 - x(x))g 2 (x) , (4) 

where gi(x),g 2 (x) are ideally oriented with directions u = (u x ,u y ) T and v = 
( v x , v y ) T respectively; and where x( x ) is the characteristic function of some half- 
plane P through the ‘center’ (to be defined later) of the junction. This model is 
appropriate for the local description of junction types T, L and P. X -junctions 
better fit a transparent model and have been treated in [1, 20]. 
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The Constraint Equation. To estimate two orientations in 17, we observe 
that Eq. (4) is equivalent to 

f(x) = ( gl(x)ifxeP (5) 

I g 2 (x) otherwise. 



Therefore, (9f(x)/<9u = 0 if x is inside of P and <9f (x ) /dv = 0 if x is outside of 
P. From the above we can draw the important and, as we shall see, very useful 
conclusion that the expression 



<9f(x) 

d\i 



ffi (x) 
dv 



= 0 



( 6 ) 



is valid everywhere except for the border of P where it may differ from zero. 
The symbol 0 denotes the tensor product of two vectors. Eq. (6) may not hold 
at the border of P because there the derivatives of the characteristic function 
%(x) are not defined. This is not the case if u and the border of P have the 
same direction, e.g., in case of a T-junction. Given Eq. (6), the tensor product 
should be symmetric in u and v. Since in practice symmetry might be violated, 
we expand the symmetric part of the above tensor product to obtain 

C XX f X 0 f X + 0 fy + fy 0 f®) + Cyyfy <g) f y = 0 (7) 



where 



Cxx — ^x^xi Cyy — ^y^y> C xy — 'U'x^y 'U"y'V x • ( 8 ) 

Note that for an image with q spectral components, the system in Eq. (7) has 
q(q + 1) /2 equations, which makes the system over-constrained if q > 2. The 
vector c = (c xx , c xy , c yy ) T is the so-called mixed orientation parameters vector 
and is an implicit representation of the two orientations. 



Estimation of the Mixed Orientation Parameters. An estimator of the 
mixed orientation parameters is obtained by a least-squares procedure that finds 
the minimal points of the energy functional 
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i.e. , c is estimated as an eigenvector c associated to the smallest eigenvalue of 
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The actual region of integration can be kept smaller for multi-spectral images 
since the system in Eq. (7) is over-constrained in this case. However, the esti- 
mator c represents two orientations only if it is consistent with Eq. (8). This is 
the case if and only if c 2 xy — 4c xx c vy > 0 . 
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Separation of the Orientations. To separate the orientations it suffices to 
know the matrix 

^X^X U X Vy fill 

[UyV X UyVy \ ^ 

because its rows represent one orientation and its columns the other, cf. [1] for 
the case of transparency. Since we already know that c xx = u x v x and c yy = u y v y , 
we need only to obtain Zi = u x v y and = u y v x . To this end, observe that 
Z\ + Z 2 = c xy and 2 : 12:2 = c xx c yy . Therefore, 2 : 1 , z-i are the roots of 

Q2 (^) Z C X yZ T C XX Cyy . (12) 



Pruning of the Orientation Fields. After the separation, each point of a 
junction neighborhood has two directions assigned to it, see Fig. 1 (b,e) and Fig. 
2 (b,e). Since only one of these is correct, we need to prune the other. For this, 
we observe that at each position only one of the equations 

<9f(x)/<9u = 0, <9f(x)/<9v = 0 (13) 

is valid. To prune the wrong vector at a given position p, we first compute the 
local histogram of the orientations in a small neighborhood (3x3 pixels) of 
p and separate the two orientations by the median. We then assign to p the 
correct direction depending on which equation in (13) is better satisfied in the 
sense that the sum of squares is lowest. This is equivalent to a procedure that 
would choose the direction of smallest variation of the image f (x) . 



Junction Localization. Since measures of confidence only give us a region 
where multiple orientations can occur, it is useful to have a method for deciding 
which point in this region is actually the center of the junction. We follow the 
approach in [9] for the localization of the junction. Let 12 represent a region of 
high confidence for a junction. For an ideal junction located at p we have 

df x (x-p) = 0 (14) 

where clf x is the Jacobian matrix of f(x) . The center of the junction is therefore 
defined and estimated as the minimal point of 

( | df x (x — p) 1 2 dl2 (15) 

J n 



which gives 

p = J'^ 1 b, where b= dfj df x xcll? . 

J n 



(16) 
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Fig. 1 . Synthetic example: panel (a) depicts a sinusoidal pattern (b) the estimated 
orientations for the marked region and (c) the orientations after pruning. Real example: 
panel (d) shows a picture of a house; (e) and (f) are analogous to (b) and (c) above. 



3 Results 

Fig. 1 depicts the results of the estimation of single and double orientations in a 
synthetic and a natural image. In panel (a) two oriented sinusoidal patterns were 
combined to form T-junctions along the main diagonal of the image to which 
Gaussian white noise was added (SNR of 25 dB). The estimated orientations 
for the selected region in (a) are shown in panel (b). Note that in a region 
around the T-junction two orientations are estimated at each pixel. Panel (c) 
shows the result obtained after the pruning process. Note that the occluding 
boundary and the orientations on both sides of the boundary are well estimated. 
Panel (d) depicts a natural image with many oriented regions and junctions. The 
estimated orientations for the selected region in (d) are shown in panel (e) . The 
orientations after the pruning process are depicted in panel (f). 

Fig. 2 presents results for T-junctions of different angles. Panel (a) depicts 
the letter ‘A’ (image with additive Gaussian noise, SNR of 25 dB). In panel (d) 
a segmentation of the ‘A’ in terms of the number of estimated orientations is 
shown: white for no orientation, black for one, and gray for two orientations. Note 
that, around all corners of the letter, two orientations are found. The estimated 
orientations for the upper-left corner of the ‘A’ are shown in panels (b) (before 
pruning) and (c) (after pruning). Panel (e) depicts the estimated orientations 
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Fig. 2. Panel (a) shows the ‘A’-letter input image. The estimated orientations before 
and after pruning are shown in (b) and (c) respectively for the upper left corner of the 
‘A’. Panel (d) depicts the segmentation in terms of the number of observed orientations 
(see text), (e) the estimated orientations for the corner with the smallest angle, and (f) 
indicates the corner location estimated according to Eq. (16) and the two orientations 
at that location. 



for the corner of the ‘A’ with the smallest angle. Pixels with two orientations are 
then used according Eq. (16) to locate the corner position. The result is indicated 
by the cross in panel (f) which also shows the corresponding orientations at that 
corner location. 

In all examples, we first search for at least one single orientation. If confidence 
for at least one orientation is low, we search for double orientations. Confidence 
is based on the invariants of J^r according to [16]. Thus, for J l5 the confidence 
criteria are H > e and \J K < c\H. For J 2 , the confidence criterion for two 
orientations is y[K < c-ilS. The numbers H , K , and S are the invariants, i.e., 
the trace, the determinant, and the sum of the diagonal minors of Ji, 2 - For 
the examples in Fig. 1, we used an integration window size of 11 x 11 pixels, 
Ci = 0.5, c 2 = 0.6. , and e = 0.001. For the example in Fig. 2, we used an 
integration window size of 7 x 7 pixels, Ci = 0.4, c-i = 0.6, and e = 0.01. The 
above parameter settings have been found experimentally. Derivatives were taken 
with a [— 1, 0, 1] T [1, 1, 1] kernel in x— and analogously in y direction. 
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4 Conclusions 

We have presented a new and accurate method for the estimation of two orienta- 
tions at image features that satisfy an occlusion model. Typical features are cor- 
ners and various kinds of junctions that occur frequently in natural images. The 
method only involves first-order derivatives and has closed-form solutions. Itera- 
tive procedures are not involved, unless one chooses to estimate the eigenvectors 
of a 3 x 3 tensor iteratively. This can be avoided as shown in [3]. Experiments 
on synthetic and real images show that junctions and corners are well described 
by the proposed model and confirm the accuracy of the method. Nevertheless, 
our results can be further improved. An obvious improvement is the use of op- 
timized derivative kernels. Derivatives could also be replaced by more general 
filters as a consequence of results obtained in [16]. A straightforward extension 
of our approach may allow for the estimation of more than two orientations. 

We have formulated our results such as to include multi-spectral images in 
a natural but non-trivial way. If q is the number of colors, the constraint that 
we use consists of q(q + l)/2 equations. This implies that for only two colors 
we already have a well conditioned system and can use even smaller neighbor- 
hoods for the estimation of two orientations. Forthcoming results will show the 
additional advantages when applying our method to multi-spectral images. 

The benefits of using corners and junctions for image analysis, registration, 
tracking etc. have often been highlighted. The estimation of the orientations 
that form these features may add further robustness and new kinds of invariant 
features. It might, for example, be easier to register a junction in terms of its 
orientations since the orientations will change less than the appearance and other 
features of the junction. The orientations seem especially useful as they can now 
be well estimated with low computational effort. 
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Abstract. In this paper, we present an approach for image reconstruc- 
tion from local phase vectors in the monogenic scale space. The local 
phase vector contains not only the local phase but also the local orienta- 
tion of the original signal, which enables the simultaneous estimation of 
the structural and geometric information. Consequently, the local phase 
vector preserves a lot of important information of the original signal. Im- 
age reconstruction from the local phase vectors can be easily and quickly 
implemented in the monogenic scale space by a coarse to fine way. Exper- 
imental results illustrate that an image can be accurately reconstructed 
based on the local phase vector. In contrast to the reconstruction from 
zero crossings, our approach is proved to be stable. Due to the local 
orientation adaptivity of the local phase vector, the presented approach 
gives a better result when compared with that of the Gabor phase based 
reconstruction. 



1 Introduction 

In the past decades, signal reconstruction from partial information has been 
an active area of research. Partial information such as zero crossing, Fourier 
magnitude and localized phase are considered to represent important features of 
the original signal. Therefore, we are able to reconstruct the original signal based 
on only the partial information. The variety of results on signal reconstruction 
has a major impact on the research fields like image processing, communication 
and geophysics. 

Reconstruction from zero crossings in the scale space is investigated by Hum- 
mel [1] . He has demonstrated that reconstruction based on zero crossings is pos- 
sible but can be unstable, unless gradient values along the zero crossings are 
added. In [2], it is proved that many features of the original image are clearly 
identifiable in the phase only image but not in the magnitude only image, and 
reconstruction from Fourier phase is visually satisfying. However, the application 
of this approach is rather limited in practice due to the computational complex- 
ity. Belrar et al. have stated in [3] that image reconstruction from localized phase 
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C.E. Rasmussen et al. (Eds.): DAGM 2004, LNCS 3175, pp. 171-178, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




172 D. Zang and G. Sommer 



only information is more efficient and faster than that from the global phase. 
The reconstruction errors produced by this method can be very small. However, 
compared with this approach, the way of image reconstruction presented in this 
paper is more easier and faster. 

In this paper, we present an approach of image reconstruction from the local 
phase vector in the monogenic scale space. Image reconstruction is easy, fast, 
accurate and stable when compared with the above mentioned approaches. In 
[4] , Felsberg and Sommer proposed the first rotationally invariant 2D analytical 
signal. As one of its features, the monogenic phase vector preserves most of the 
important information of the original signal. The local phase vector contains not 
only the local phase but also the orientation information of the original signal, 
which enables the evaluation of structure and geometric information at the same 
time. The embedding of local phase and local orientation into monogenic scale 
space improves the stability and robustness. However, in the Gaussian scale 
space, there is no common filter set which could evaluate the local orientation and 
local phase simultaneously. To show the advantage of our approach, we replace 
the Gaussian kernel with the Gabor filter for phase evaluation, the reconstruction 
results of these two approaches are compared in this paper. 

2 The Monogenic Scale Space 

The structure of the monogenic scale space [5] is illustrated in Fig.l. The 
Riesz transform of an image yields the corresponding figure flow, a vector field 
representing the Riesz transformed results. If we define u = (iti,w 2 ) T and 
x = (xi, X 2 ) t , then the Riesz kernel in the frequency domain reads H( u) = 
and the convolution mask of the Riesz transform is given by h(x) = . The 

combination of the signal and its Riesz transformed result is defined as the mono- 
genic signal. Let /(x) represent the input signal, the corresponding monogenic 
signal thus takes the form: fjvf(x) = /(x) + (h * /)(x). The monogenic signal is 
a vector valued extension of the analytical signal, it is rotation invariant. The 
monogenic scale space is built by the monogenic signals at all scales, it can al- 
ternatively be regarded as the combination of the Poisson scale space and its 
harmonic conjugate. The Poisson scale space and its harmonic conjugate form 
the monogenic scale space, they are obtained as follows, respectively. 

p(x; s) = (/ * P)(x) where P(x) = / — — — (1) 

27t(|x| + s z ) A l z 



q(x; s) = (/ * Q)(x) where Q(x) = 2 —— (2) 

27t(|x| + s z ) 6 ' z 

In the above formulas, P and Q indicate the Poisson kernel and the conjugate 
Poisson kernel, respectively. At scale zero, the conjugate Poisson kernel is exactly 
the Riesz kernel. The Poisson scale space p(x; s ) is obtained from the original 
image by Poisson filtering, its harmonic conjugate is the conjugate Poisson scale 
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Fig. 1 . The structure of the Monogenic Scale Space [5] 



space q(x; s), which can be formed by the figure flows at all scales. The unique 
advantage of the monogenic scale space, compared with the Gaussian scale space, 
is the figure flow being in quadrature phase relation to the image at each scale. 
Therefore, the monogenic scale space is superior to the Gaussian scale space if 
a quadrature relation concept is required. 



3 Important Features of the Monogenic Scale Space 



As an analytical scale space, the monogenic scale space provides very useable 
signal features including local amplitude and local phase vector. The local phase 
vector contains both the local phase and local orientation information, which 
enables the simultaneous estimation of structural and geometric information. 
The local amplitude represents the local intensity or dynamics, the local phase 
indicates the local symmetry and the local orientation describes the direction of 
highest signal variance. Let p(x; s) and q(x; s) represent the Poisson scale space 
and its harmonic conjugate, the logarithm of the local amplitude, namely the 
local attenuation in the monogenic scale space reads: 

A(x; s) = log(^/|q(x; s)| 2 + (p(x; s)) 2 ) = ^ log(|q(x; s)| 2 + (p(x; s)) 2 ) (3) 

The local orientation and the local phase are best represented in a combined 
form, namely, the local phase vector r(x; s). It is defined as the following form: 



r(x; s) 



q (x;s) 

|q( x ; s)| 



arctan( 



|q( x ; «)k 

p(x;s) j 



(4) 



Whenever an explicit representation of phase or orientation is needed, the 
local orientation can be extracted from r as the orientation of the latter, and 
the local phase is obtained by projecting r onto the local orientation. The local 
phase vector thus denotes a rotation by the phase angle around an axis perpen- 
dicular to the local orientation. In the monogenic scale space, the local phase and 
orientation information are scale dependent, which means the local phase and 
orientation information can be correctly estimated at an arbitrary scale simul- 
taneously. Unlike the monogenic scale space, there is no common filter set in the 
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Gaussian scale space which enables the estimation of phase and orientation at 
the same time. The evaluation of phase in that traditional framework is possible 
when the Gaussian kernel is replaced by the Gabor filter, and by using Gaussian 
derivatives, orientation can be evaluated. However, the Gabor filter and Gaus- 
sian derivatives are not compatible in the Gaussian scale space, the phase and 
orientation obtained from them are simply a collection of features, these two 
features can not be evaluated simultaneously in the Gaussian framework. 



4 Image Reconstruction in the Monogenic Scale Space 

It is reported in [6] that the local attenuation and the phase response of a 
minimum phase filter form a Hilbert pair. Under certain conditions, this could 
also be generalized to 2D. For a 2D signal with an intrinsic dimension of one, if 
the scale space representation has no zeros in the half space with s > 0, then 
the local attenuation and the local phase vector form a Riesz triplet [5] 

r(x;s) « (h*H)(x;s) (5) 

where h refers to the Riesz kernel. In practice, images are in general not glob- 
ally intrinsical ID signal. However, they commonly have lots of intrinsically ID 
neighborhoods which makes the reconstruction from the local phase vector avail- 
able. In most practical applications zeros occur in the positive half-space, but as 
we can see from [5], the influence of the zeros can mostly be neglected. To re- 
cover the amplitude information from only the phase vector information, we take 
the inverse Riesz transform of the local phase vector. By definition, the Riesz 
transform of the local phase vector is DC free. This means that the transformed 
output has no DC component. Consequently, the DC-free local attenuation in 
the scale space is approximated by the following form 

H(x; s) — H(x; s) « — (h * r)(x; s) (6) 

where 4(x; s) indicates the DC component of the local attenuation that should 
be calculated beforehand. Hence, the original image reconstruction based on the 
local phase vector reads 

/(x) = exp(4(x; 0))exp(-(h * r)(x; 0))cos(|r(x; 0)|) + C DC (7) 

where Cdc denotes a further DC correction term corresponding to a gray value 
shift. To reconstruct a real image, we use only the real part of the local phase 
vector cos(|r(x; 0)|). The above introduction indicates that image reconstruction 
from the local phase vector can be easily and quickly implemented, no iterative 
procedure is needed. 

To investigate the image reconstruction in the monogenic scale space, a scale 
pyramid structure is employed. The differences of monogenic signals at adjacent 
scales are first computed as the bandpass decomposition at different frequencies 
in the monogenic scale space. The information of different bandpasses forms a 
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Laplacian pyramid. Local phase vectors of the corresponding bandpass informa- 
tion are considered as the partial information. Signals can thus be reconstructed 
in the scale space by a coarse to fine way. Let denote the representation of 
the image in the pyramid at scale s, then the one scale higher representation 
reads By interpolation, g*- s+1 - ) is expanded as = Tjg^ s+l \ where 

Tj refers to the operation of interpolation and has the same s j ze 0 f g( a ) _ 

The difference of adjacent scales can then be computed as 

/(“) = g (s) - <? (s+1) = g {s) - (8) 

where can be regarded as a bandpass decomposition of the original image. 
Based on only the local phase vector of the intermediate representation, the 
reconstruction at different scales can be implemented as follows 

T (s) = exp(A(x; s))exp(— (h * r)(x; s))cos(|r(x; s)|) + C DC (9) 

where l ^ describes the reconstructed result at a certain scale. By means of a 
coarse to fine approach, all the scale space images can be combined together 
to make the final reconstruction of the original image. Starting from the most 
coarse level, the recovery of one scale lower image takes the following form 

g{s) = J(s) + T ~(s+ 1 ) ( 10 ) 

This is an iterative procedure. It will end until s goes to zero, hence, g ^ indicates 
the final reconstruction. 




Fig. 2. Test images, from left to right, are lena , bird, and circles (synthetic image) 



5 Experimental Results 

In this section, we present some experiments to check the performance of image 
reconstruction based on the local phase vector in the monogenic scale space. 
Three images used for the experiment are shown in Fig. 2. Image reconstruction 
in the monogenic scale space is illustrated in Fig. 3. Although we use pyramid 
structures for scale space reconstruction, the results shown in Fig. 3 at different 
scales are scaled to the same size as the original one. The top row shows the 
original image and the corresponding absolute error multiplied by 8. Bottom 
row demonstrates the reconstructed results in the monogenic scale space. The 
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left image in the bottom row is the final result, which is reconstructed by a 
coarse to fine way. The final reconstruction has a normalized mean square error 
(NMSE) of 0.0018 when compared with the original one. This demonstrates that 
image reconstruction can be implemented accurately from the local phase vector. 




Fig. 3. Image reconstruction in the monogenic scale space. The original image and the 
absolute error image multiplied by 8 are shown in the top row. Bottom row: right three 
images demonstrate the intermediate reconstruction in the monogenic scale space, the 
left image indicates the final result. 



A successful reconstruction from partial information requires a stable output. 
To investigate the performance of reconstruction from the local phase vector, we 
conduct another experiment by adding noise to contaminate the input images 
and checking the outputs. In this experiment, the bird image and the lena image 
are used as the noise contaminated inputs, outcomes are shown in Fig. 4. The 
NMSEs increase when the signal noise ratio (SNR) is reduced. However, for 
both cases, our approach results in limited reconstruction errors even the SNR 
is set to zero. The results indicate that reconstruction based on the local phase 
vector is a stable process, hence, the local phase vector can be regarded as stable 
representation of the original signal. In contrast to this, reconstruction from only 
zero crossings is proved to produce unstable results [1] unless the gradient data 
along the zero crossings are combined for reconstruction. 

There is no common filter set in the Gaussian framework to evaluate the 
phase and orientation simultaneously. However, phase information can be esti- 
mated when the Gaussian kernel is replaced by the Gabor filter. To show the 
advantage of our approach, we compare the results of our method with that of 
the Gabor phase based case. A certain orientation must be assigned to the Ga- 
bor filter beforehand. In this case, the orientation is independent with the scale 
space, local orientation estimation does not change when the scale is changed. 
Superior to the Gabor phase, the monogenic phase vector enables the estima- 
tion of structural and geometric information simultaneously at each scale space. 
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Fig. 4. Normalized mean square error with respect to signal noise ratio 






Fig. 5. Upper row: images from left to right are the original image and the recon- 
structed results based on Gabor phases with orientations of 45° and 135°, the corre- 
sponding NMSEs are 0.0833 and 0.0836. Bottom row: the Left one shows the recon- 
struction based on the monogenic phase, it has a NMSE of 0.0014. The middle and the 
right images are the results from Gabor phases with orientations of 0° and 90°, the 
NMSEs are 0.0812 and 0.0815, respectively. 



In the monogenic scale space, local phase vector and local attenuation form a 
Riesz triplet, which means that the amplitude can be easily recovered from the 
local phase vector simply by using inverse Riesz transform. Unfortunately, the 
Gabor phase and the local amplitude do not have the property of orthogonal- 
ity. Hereby, we have to employ an iterative algorithm to reconstruct the image 
based on local Gabor phases. The iterative reconstruction procedure is similar to 
the Gerclrberg Saxton algorithm [7]. By alternatively imposing constrains in the 
spatial and frequency domains, an image could be reconstructed in an iterative 
way. The comparison results are illustrated in Fig. 5, four channels with orienta- 
tions of 0°, 45°, 90° and 135° are considered, the corresponding normalized mean 
square errors are 0.0812, 0.0833, 0.0815, 0.0836. It is obvious that Gabor phase 
only preserves the information at the given orientation, however, the monogenic 
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phase results in an accurate and isotropic outcome with an NMSE of 0.0014. 
Due to the rotation invariant property of the monogenic signal, signals can be 
well reconstructed in the isotropic way. 

6 Conclusions 

In this paper, we have presented an approach to reconstruct an image in the 
monogenic scale space based on the local phase vector. According to the esti- 
mated local structural and geometric information, an image can be easily and 
quickly reconstructed in the monogenic scale space by a coarse to fine way. Ex- 
perimental results show that accurate reconstruction is available. In contrast to 
the reconstruction from zero crossings, a stable reconstruction can be achieved 
based on the local phase vector. Furthermore, the very nice property of local ori- 
entation adaptivity can result in a much better reconstruction when compared 
with that of the orientation selective Gabor phase. 
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Abstract. Recent findings in biological neuroscience suggest that the brain learns 
body movements as sequences of motor primitives. Simultaneously, this principle 
is gaining popularity in robotics, computer graphics and computer vision: move- 
ment primitives were successfully applied to robotic control tasks as well as to ren- 
der or to recognize human behavior. In this paper, we demonstrate that movement 
primitives can also be applied to the problem of implementing lifelike computer 
game characters. We present an approach to behavior modeling and learning that 
integrates several pattern recognition and machine learning techniques: trained 
with data from recorded multiplayer computer games, neural gas networks learn 
topological representation of virtual worlds; PCA is used to identify elementary 
movements the human players repeatedly executed during a match and complex 
behaviors are represented as probability functions mapping movement primitives 
to locations in the game environment. Experimental results underline that this 
framework produces game characters with humanlike skills. 



1 Motivation and Overview 



Computer games have become an enormous business; just recently, its annual sales 
figures even surpassed those of the global film industry [3]. While it seems fair to say 
that this success boosted developments in fields like computer graphics and networking, 
commercial game programming and modern artificial intelligence or pattern recognition 
hardly influenced each other. However, this situation is about to change. On the one hand, 
the game industry is beginning to fathom the potential of pattern recognition and machine 
learning to produce life-like artificial characters. On the other hand, the AI and pattern 
recognition communities and even roboticists discover computer games as a testbed in 
behavior learning and action recognition (cf. e.g. [1,2,10,13]). 

This paper belongs to the latter category. Following an idea discussed in [2], we 
report on analyzing the network traffic of multiplayer games in order to realize game 
agents that show human-like movement skills. From a computer game perspective, this 
is an interesting problem because many games require the player to navigate through 
virtual worlds (also called maps). Practical experience shows that skilled human players 
do this more efficiently than their computer controlled counterparts. They make use of 
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shortcuts or perform movements which artificial agents cannot perform simply because 
their programmers did not think of it 1 . 

An intuitive idea to address this problem is to identify elementary building blocks of 
movements and to learn how they have to be sequenced in order to produce the desired 
complex movement behavior. In fact, as recent results from biological neuroscience 
suggest, this seems to be the way the human brain constructs body movements [7,16]. 
This observation is substantiated by psychological experiments on imitation learning 
which suggest that infants devote much of their time to the learning of elementary limb 
movements and of how to combine them to reach a certain goal [14]. Not surprisingly, 
movement primitives are thus becoming popular in robotics and computer vision, too. 
Schaal et al. [15] describe how nonlinear differential equations may be used as dynamic 
movement primitives in skill learning for humanoid robots. Given motion sensor data, 
Mataric et al. [4] apply PCA and linearly superimpose resulting elementary movements 
to reproduce arm movements demonstrated by human subjects. Using spatio-temporal 
morphable models, Ilg and Giese [9] and Giese et al. [8] linearly combine primitive 
movements extracted from data recorded with motion capture systems to produce high 
quality renderings of karate moves and facial expressions, respectively. Galata et al. [6] 
apply variable length Markov models trained with vectors describing elementary 2D or 
3D motions to synthesize or recognize complex activities of humans. 

However, the preconditions for moving a character through a virtual world differ 
from the ones in the cited contributions. While these focus on movements of limbs or 
facial muscles and neglect the environment, a game character moves as a whole and 
appropriate movement sequences will depend on its current location and surroundings. 
For instance, movements leading the agent into a virtual perils like abysses or seas 
of lava would be fatal and must be avoided. We are thus in need of a methodology 
that except for elementary moves also learns a representation of the environment and 
generates movement sequences with respect to the current spatial context. As the next 
section will show, neural gases are well suited to learn topological representations of 
game environments. In section 3, we shall discuss the extraction of movement primitives 
from network traffic of multiplayer games and how to relate them to the neural gas 
representation of the environment. Section 4 will present experiments carried out within 
this framework and a conclusion will close this contribution. 



2 Neural Gas for Learning Topological Representations 

For a representation of the virtual world we are learning a topology using a Neural 
Gas [12] algorithm, which is a cluster algorithm showing good performance when it 
comes to topology learning [11], The training data used for Neural Gas learning consists 
of all locations p = {x, y. z) a human player visited during various plan executions, 
thus staying very close to the actual human movement paths. Application of a Neural 
Gas algorithm to the player’s positions results in a number of prototypical positions, 

1 An example is the rocket jump in ID Software’s (in)famous game Quake II. Shortly after the 
game was released, players discovered that they can jump higher if they make use of the recoil 
of a fired rocket. I.e. if they fired a rocket to the ground immediately after they jumped, their 
avatar was able to reach heights never planned by the programmers. 
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Fig. 1 . A simple map and its corresponding topological representation 



which are interconnected only by the player’s actions. Thereby non-reachable (or at 
times unimportant) world positions are excluded in advance (in theory any point in the 
gaming world could be reached by using game exploits such as the rocket jump, therefore 
topology learning reveals a more accurate discretization of the 3D gaming world). 

Since a sufficient number of cluster centres varies among maps, we were also using 
the Growing Neural Gas [5] to determine the number of cluster centres to reach a given 
error value. Thereby we can choose proportional number of cluster centres for each map, 
once a suitable error value is found. The experiments were carried out using an mean 
squared error value of 0.006 and resulted in 800 to 1400 cluster centres for larger maps, 
and about 100 to 200 centres for smaller maps. 

For our approach the interconnections between nodes are not needed, therefore we 
can safely skip the edge learning. Figure 1 shows a small map and its corresponding 
topological representation, edges were drawn for clarification reasons, they are of no 
further use in the presented approach. 

By assigning training samples of recorded player actions to cluster centres in the 
topological map, small sets of localized training samples are generated. Each separated 
training set defines the legal actions for a specific region, not only in the topological map, 
but also in the simulated 3D world. However, for a further investigation of the player’s 
behavior we first have to introduce movement and action primitives. 

3 Movement Primitive Extraction 

Evidence form neuroscience indicates that complex movements in humans/animals are 
built up by combinations of simpler motor or movement primitives [7] . For a more life- 
like appearance of computer game character motions, a biological approach utilizing 
movement primitives seems promising. To identify the underlying set of basic move- 
ments, PCA is applied to the training samples. A training sample set consists of a number 
of eight dimensional motion vectors 

t = [yaw angle, pitch angle , ..., veclocity forward, player fire] 
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completely defining an action the player executed (a motion vector corresponds to a 
human player’s mouse and keyboard inputs). These motion vectors can be directly send 
to a server, no further mapping on agent motor commands is needed. 

The resulting eigenvectors e provide the elementary movements, of which a motion 
vector can be constructed. No dimension reduction is applied at this stage, PCA is just 
used for computing an optimal representation. Thereby a projection of the observed mo- 
tion vectors onto the eigenvectors can be viewed as a reconstruction of player movements 
by their movement primitives. 

Frequently executed player movements can be grouped to gain action primitives , 
which are of course built up using movement primitives. To acquire a set of action 
primitives, the training samples are projected onto the eigenmovement space, in which 
they are clustered using a k-means algorithm (similar to the primitive derivation in [4]). 
This results in a set of cluster centers, each of which representing a single action primitive. 
The number of action primitives does have an influence on the overall smoothness of the 
later motion sequences, we achieved good results by choosing 500 to 700 cluster centers. 
The right number of cluster centres depends on the number of training samples and on 
the variety of observed motions. However, even the representation of a large number of 
training samples could not be further improved by choosing a higher number of cluster 
centres. This indicates, that there might be a fixed number of action primitives, which 
guarantee a smooth execution of motion sequences. 

Sequentially executed action primitives lead to complex behaviors (in fact all human 
behaviors can be interpreted as sequences of action primitives). In order to generate 
human-like motions, the action primitives need to be executed in a convincing manner, 
based on the training set motion vectors. 

Since the actual player’s movement depends on the surrounding and his position on 
the map, the probability of executing a specific action primitive Vi can be denoted as 

P Vi = P{vi\w k ) ( 1 ) 

where Wk denotes a certain node in the topological map. The acquisition of the condi- 
tional probabilities is fairly easy, each movement vector can be assigned to a node in the 
topological representation and it can be assigned to an action primitive. Counting the 
evidences of action primitives in all nodes results inamx n matrix, where to denotes the 
number of nodes in the topological map and n denotes the number of action primitives. 
A matrix entry at position k, i denotes the probability P(vi\wk) of executing an action 
primitive Vi for node number 

However, in a sequence of action primitives, not every primitive can be executed as a 
successor of any primitive. On the one hand humans tend to move in a smooth way, at least 
compared to what would be possible for an artificial player, on the other hand humans 
are bound to physical limitations of their hand motion, besides, some player’s might 
have certain habits , they tend to jump for no reason or make other kinds of useless, yet 
very human movements. To reflect those aspects, the probability of executing a primitive 
Vi as a successor of a primitive V/ should be incorporated. It can be denoted as 



P Vi = P(Vi\vi) 



(2) 
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Fig. 2. The artificial game character, observed while executing one of the experiment movements 
- a jump to an otherwise not reachable item 



The probabilities can be extracted from the training samples by inspecting the observed 
action primitive sequence, resulting in a n x n transition matrix, where n denotes the 
number of action primitives. 

Assuming the conditional probabilities P(vi\vi) and P(vi\wk) are independent, the 
overall probability for the execution of a primitive can now be denoted as 

= P(Vi\vi,Wk) = P(Vi\vt)P(Vi\Wk) 

Eu=iP(vu\vi,w k ) EIUPKMPKK) u 

More conditional probabilities, expressing a greater variety of dependencies, could 
be incorporated at this point. For example, action primitive selection based on an enemy 
player’s relative position or based on the current internal state of the player. In the 
presented approach we wanted to concentrate on movements for handling environmental 
difficulties, while still creating the impression of a human player, therefore ignoring 
further possibilities of the presented approach. 

When placed in the game world, the next action for the artifical game character is 
chosen randomly using a roulette wheel selection according to the P v . 

4 Experiments 

To test the presented approach, we carried out a set of eight smaller experiments. Each 
experiment consisted of a separate training sample set, in which (at times) complicated 
movement sequences were executed several times by a human player. The observable 
motions varied from simple ground movements to more complex jump or shooting 
maneuvers (or combinations of both). In addition, a larger training sample set, this time 
a real match between two human players, was used. 

In all but one experiments the observed movement could be reproduced in a con- 
vincing manner. In one experiment the artifical player usually ended in front of wall and 
stopped moving. More training samples, containing hints for mastering the described 
situation, could have led to a smarter impression and a better recreation of the observed 
motion. 

Although staying very close to the training set movements, an exact reproduction 
almost never occurred because of the randomness in sequence generation. While this 
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Fig. 3. Comparison of the artificial player’s (blue) movement trajectories in 3D and the trajectories 
of a human player (red). The left picture shows a combination of different jumps to get to reach 
an item located on a platform. The right image shows a repeated jump through a window. 



might seem like a disadvantage, it definitely adds to a more human-like impression. 
Figure 3 and Figure 4 show the trajectories of the artificial player compared to a human 
test player in 3D. 

Besides the basic reconstruction of certain motions, the artificial player’s motions 
themselves, his way of turning, jumping and running, were naturally looking - creating 
the illusion of a human player (this applies to it’s motion, not the tactical/strategic 
decisions). Even complicated actions, for example the famous rocket jump (an impossible 
maneuver for an inexperienced human player) could be learned and executed. 

The more realistic training set, two human players competing on a smaller map, 
finally resulted in a very good imitation of a broader repertoire of a human player’s 
motion, indicating, that our approach is suitable for larger scale problems. The strong 
coupling between the game characters position in the topological representation and 
the selection of motion primitives made the character appear smart, by acting in an 
appropriate way to the architecture of the 3D game world - jumping over cliffs or standing 
still when using an elevator. In addition the approach preserved certain player habits by 
(in one case) executing senseless jumps from time to time. Since no information about 
the enemy player was introduced during live play, some kind of shadow fighting could 
be observed, as if an enemy would be present. 

A further incorporation in already developed approaches [17] for strategic move- 
ment should be established. The human motion generation does not hinder the agent to 
occasionally walk in circles or do other kinds of human looking but nevertheless stupid 
movements, after all the approach pays attention to realistic motions, goal oriented 
movements were not intended. 



5 Conclusion and Future Work 

In order to create life-like motion for computer game characters, we decided for a 
biologically inspired approach by using movement primitives as the basic building blocks 
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Fig. 4. Comparison of the artificial player’s (blue) movement trajectories in 3D and the trajectories 
of a human player (red). The experiment underlying the left image contained an elevator usage 
and a long jump to a platform. While the right image displays the trajectories extracted from the 
experimental results illustrated in Figure 2 



of motion. PCA applied to observed human motions revealed the eigenmoves, or in our 
case movement primitives. Prototypical motion vectors were extracted using k-means 
clustering in the eigenspace of the used training samples. Finally conditional probabilities 
of the execution of a specific motion primitive were computed, dependent on the position 
in a topological map and on the last executed action primitive. When playing, the artificial 
player selects its next action based on the conditional probabilities, thus favoring more 
common sequences of action. And indeed, our experiments show a good performance 
by reproducing -even complicated- training movement sequences. The artificial player’s 
motion appears surprisingly realistic and preserves certain human habits, which of course 
adds to a life-like impression. 

Besides the topological map position, there might be other features on which motion 
primitive selection may depend. For example, an enemy player’s movement might be of 
importance. Integration of such conditional probabilities should be possible and could 
provide improvements. Besides a further development of the described approach, an 
integration in available approaches for strategical and reactive imitation of human players 
would be desirable. After all life-like motion for computer game characters is only 
one aspect for the imitation of a human player, though of great importance. It also is 
of interest, if our approach could be applied to other domains in the field of action 
recognition/simulation for humans as well as for animals. First ideas have already been 
discussed with biologists, who are investigating sequences in the courtship behavior of 
zebra finches. 

Acknowledgments. This work was supported by the German Research Foundation 
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Abstract. The well-known and very simple MinOver algorithm is reformulated 
for incremental support vector classification with and without kernels. A modified 
proof for its 0(t 1 ' 2 ) convergence is presented, with t as the number of training 
steps. Based on this modified proof it is shown that even a convergence of at 
least 0(t 1 ) is given. This new convergence bound for MinOver is confirmed by 
computer experiments on artificial data sets. The computational effort per training 
step scales as O(N) with the number N of training patterns. 



1 Introduction 

The Support- Vector-Machine (SVM) [1], [12] is an extremely successful concept for 
pattern classification and regression and has found widespread applications (see, e.g. 
[6], [9], [11]). It became a standard tool like Neural Networks or classical approaches. 
A major drawback, particularly for industrial applications where easy and robust im- 
plementation is an issue, is its complicated training procedure. A large Quadratic- 
Programming problem has to be solved, which requires numerical optimization routines 
which many users do not want or cannot implement by themselves. They have to rely 
on existing software packages which are hardly comprehensive and, in many cases at 
least, error-free. This is in contrast to most Neural Network approaches where learning 
has to be simple and incremental almost by definition. 

For this reason a number of different approaches to obtain more or less simple and 
incremental SVM training procedures have been introduced [2], [3], [10], [4], [7]. We 
will revisit the MinOver algorithm which was introduced by Krauth and Mezard [5] 
for spin-glass models of Neural Networks. As a slight modification of the perceptron 
algorithm, it is well-known that MinOver can be used for maximum margin classifi- 
cation. In spite of the fact that a training procedure can hardly be more simple, and in 
spite of the fact that advantageous learning behaviour has been reported [8], so far it has 
not become a standard training algorithm for maximum margin classification. To make 
MinOver more attractive we give a simplified formulation of this algorithm and show 
that, in contrast to the 0(t 1 ,/2 j convergence bound given in [5], in fact one can expect 
a 0(t 1 ) convergence, with t as the number of learning steps. 
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1.1 The Problem 

Given a linearly separable set of patterns x„ £ E w , v = 1, . . . , N with corresponding 
class labels y v £ { 1,1}. We want to find the hyperplane which separates the patterns 
of these two classes with maximum margin. The hyperplane for classification is deter- 
mined by its normal vector w £ M D and its offset b £ R. It achieves a separation of the 
two classes, if 



y u ( w T x„ b) > 0 for all v = 1, . . . , N 
is valid. The margin A of this separation is given by 

A = min [?/,,( w T x,, 6)/||w|j], 

V 

For convenience we introduce z„ = 2/„(x„, 1) £ R' D+1 and v = (w, b) £ R D+1 . 

We look for the v which maximizes zi(v) = min„ [v T z^/| |w| |], With 

d(v) = min[v T z y /||v||] 

V 

we introduce the margin of separation of the augmented patterns (x„, 1) in the ( D + 

l)-space. The v which provides the maximum margin d* in the ( 1) + l)-space also pro- 
vides the maximum margin A* in the /1-dimensional subspace of the original patterns 
x„ £ M 15 . This is the case since (i) the v* which provides A* also provides at least 
a local maximum of c/(v) and (ii) c/(v) and Z\(v) are convex and both have only one 
global maximum. Therefore, 

v* = (w *,6») = arg max [min(v T z 1/ )/| |w| |] 

I I v| 1 = 1 ^ 



= arg max [min(v T z 1/ )l 

I |v| |=i v 



is valid. Instead of looking for the which provides the maximum A, we look for the 
v* which provides the maximum d. Both v* are identical. Since ||v*|| 2 = ||w*|| 2 + 
bl = 1, we obtain A* from d* and v* = (w *, 6*) through 



A* = 




d*f 

Trlf 



2 The MinOver Algorithm Reformulated 

The well-known MinOver algorithm is a simple and iterative procedure which provides 
the maximum margin classification in linearly separable classification problems. It was 
introduced in [5] for spin-glass models of Neural Networks. The MinOver algorithm 
yields a vector v t the direction of which converges against v* with increasing number 
of iterations t. This is valid as long as a full separation, i.e. a v* with A“ > 0 exists. 

The MinOver algorithm works like the perceptron algorithm, with the slight modi- 
fication that for training always the pattern z a (t) out of the training set T = {z^\v = 
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1, . . . , N} with the worst, i.e. the minimum residual margin (overlap) v r z v is chosen. 
Hence, the name MinOver. 

Compared to [5] we present a simplified formulation of the MinOver algorithm, 
with the number of desired iterations t ma x prespecified: 

0. Set t = 0, choose a t ma x , and set v t= o = 0. 

1 . Determine the z a (t) out of the training set T for which vf z is minimal. 

2. Set v t+ i = v t + z a (t). 

3. Set t = t + 1 and go to 1.) if t < t m ax- 

2.1 MinOver in Its Dual Formulation and with Kernels 

The vector v t which determines the dividing hyperplane is given by 

t l 

v t = ^z a (r) 

T — 0 

= n„(t)z„ 

z„ev t 

with Vt C T as the set of all patterns which have been used for training so far. The 
coefficient n v {t) £ N denotes the number of times z,, £ Vt has been used for training 
up to time step t. n„(t) = t is valid. With Vt = |Vt| < t we denote the number of 
training patterns which determine the normal vector v t . 

In the dual representation the expression which decides the class assignment by 
being smaller or larger than zero can be written as 

v T z = ^2 n u y„{xlx.) b (1) 

x„ev 

with 

b = ^2 ( 2 ) 

In the dual formulation the training of the MinOver algorithm consists of either adding 
the training pattern z Q to V as a further data point or, if z Q is already element of V, to 
increase the corresponding n a by one. 

If the input patterns x £ R D are transformed into another (usually higher dimen- 
sional) feature space d>(x) £ R D before classification, MinOver has to work with 
z „ = )y I/ ( < t , (x, / ), 1) T . Due to Equation (1) it does not have to do it explicitly. With 

K (x„. x) = <I> T (x„)<I>(x) as the kernel which corresponds to the transformation <l>(xj, 
we obtain 

v T z = y['^2n v y v K(x.„,x) b j, (3) 

\x„GV J 

with the b of Equation (2). 

In its dual formulation the MinOver algorithm is simply an easy procedure of se- 
lecting data points out of the training set: 
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0. Set t = 0, choose a t ma x , and set V = 0. 

1 . Determine the z a (t) out of the training set T for which vf z according to Equa- 
tion (3) is minimal. 

2. If z a (t) V, add z a (t) to V and assign to it an n a = 1. If z a [t) £ V already, 
increase its n a by one. 

3. Set t = t + 1 and go to 1.) if t < t m ax- 

3 Convergence Bounds for MinOver 

Krauth and Mezard gave a convergence proof for MinOver [5]. Within the context of 
spin-glass Neural Networks they showed that the smallest margin d± = ~vj z a (t) pro- 
vided by v t converges against the maximum margin d* at least as 0(t 1 / 2 ). We give 
a modified proof of this 0(t 1//2 ) convergence. Based on this proof we show that the 
margin converges even at least as 0(t 1 ) against the maximum margin. 

3.1 0{t 1 / 2 ) Bound 

We look at the convergence of the angle between v t and v*. We decompose the 
learning vector v t into 



1 1 u*|| < R\ft is valid, with R as the norm of the augmented pattern with maximum 
length, i.e., R = max„ ||z„||. This can be seen from 



v t = cos 7 t||v t ||v* + u t 



u t v* = 0 . 



(4) 



u 2 +1 u 2 = (u t + z a (t) [z Q (f)v*]v *) 2 u 2 



(5) 



= 2 ufz a (f) + z a {t) 2 [z Q (f) T v *] 2 
< R 2 . 



We have used u/’ z a (t) < 0. Otherwise the condition 




would be violated. Since also 



t l 




is valid, we obtain the bounds 



sin 7 t < 7 1 < tan 7 1 



ut|| Ry/t _ R/d « 



(6) 



Vjv t d*t y/t 



The angle 7 between the hyperplane provided by MinOver and the maximum margin 
hyperplane converges to zero at least as 0(t 1 ' 2 ). 
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After a finite number of training steps the z a (t) selected for learning will always be 
support vectors with d* = vj’z„(f). This can be seen from the following arguments: 
with Equation (4) we obtain 



d* > 




= vfz Q (f) cos 7 1 



u[z a (t) 

I Ml 



sin 7 t . 



(7) 



If z a (t) is not a support vector, v;f z a (f) > d* is valid. Since the prefactor of the sinus 
is bounded, after a finite number of training steps the right hand side would be larger 
than d* . Hence, after a finite number of learning steps the z a (t) can only be support 
vectors. 

Equation (7) yields the convergence of d t . We obtain 



d* > dt > d* cos 7 t i?sin 7 t > d*(l 7 t 2 / 2 ) Rj t ■ 



With the term leading in 7 1 and with our upper bound for 7 t , the convergence of the 
margin with increasing t is bounded by 



. d* d t R R 2 /dl 

< < — 7 1 < — — — 

d* d* 



3.2 0(t *) Bound 

From Equation (6) we can discern that we obtain a 0(t 1 ) bound for the angle and, 
hence, aO(i 1 ) convergence of the margin dt to the maximum margin d*, if ||u t || 
remains bounded. This is indeed the case: 

We introduced u, as the projection of v t onto the maximum margin hyperplane 
given by the normal vector v*. In addition we introduce s„ = z„ (v;f z„)v* as the 
projection of the training patterns z v onto the maximum margin hyperplane given by 
v*. As we have seen above, after a finite t = t start, each s a (t) corresponds to one of the 
Ns support vectors. Then looking for the z a (t) out of the training set T for which v [ z 
is minimal becomes equivalent to looking for the s Q (f) out of the set S' of projected 
support vectors for which u j z = u[ s is minimal. 

We now go one step further and introduce u' t as the projection of u t onto the 
subspace spanned by the s I7 £ S' . This subspace is at most Ag-dimensional. Since 
ufsa = u'Jsv for the £ S', we now look for the s a (t) £ S' for which u'^s is 
minimal. Note that for t > t start always uj z a (t) = uj s a (t) = u' t s a (t) < 0 is valid. 

The following analysis of the development of u over time starts with u t atart ■ We 
have 

t 1 

u * = u * 3tart + s « ( r )- 
T—t start 

u t remains bounded, if u ' t remains bounded. We discriminate the following three cases: 

i) max|| u /|| =1 min Sl/e 5 /(u ,T s I/ ) < 0 

ii) max|| u ,|| =1 min Si/65 , j || Si; || >0 (u' T s I/ ) > 0 

iii) max|| u ,|| =1 min Si/e5 , i || Si ,|| >0 (u' T s I/ ) =0 
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Note that the vector u' with ||u'|| = 1 varies only within the subspace spanned by the 
s„ £ S' . If this subspace is of dimension one, only i) or ii) can occur. For i) and ii) it 
can quickly be proven that u' t remains bounded. Case iii) can be redirected to i) or ii), 
which is a little bit more tedious. 

i) There is an e > 0 such that for each training step u'Js a (t) < e||u' t ||. Analog 

to Equation (5) we obtain 



Au' 2 t = 2u'fs a (t) + s a (t) 2 
< 2e| | u' t 1 1 + R 2 . 

The negative contribution to the change of | ju' t 1 1 with each training step increases with 
1 1 u't 1 1 and keeps it bounded. 

ii) There is a u' such that u' 2 s„ > 0 for each 1 1 1 1 > 0. In this case there is a 
s„ £ S' with ||s„|| = 0, since always u' s a (t) < 0 has to be valid. If s Q (t) = 0, the 
change of the vector u' t terminates, since also s a (t + 1) will be zero. Will s a (t) be 
zero after a finite number of training steps? It will since there is a u' which separates 
the 1 1 s„ 1 1 > 0 from s ;/ = 0 with a positive margin. We know from perceptron learning 
that in this case also u't will separate these IM> 0 after a finite number of learning 
steps. At the latest when this is the case s Q (t ) will be zero and | |u' t 1 1 will stay bounded. 

iii) We will redirect this case to i) or ii). With u'* we denote the u', | |u' 1 1 = 1 which 
maximizes min Si/g 5 / || Si/ || > 0 (u' T Si / ). The set of those s„ £ S' with u'fs y = 0 we 
denote by S" . The s ;/ £ S' /S" are separated from the origin by a positive margin. After 
a finite number of learning steps s a (t) will always be an element of S”. Then looking 
for the s a {t) out of S' for which u'Js is minimal becomes equivalent to looking for the 
s a (t) out of the set S" for which u " 2 s is minimal, with u" t as the projection of u' t 
onto the subspace spanned by the s„ £ <S". Note that the dimension of this subspace is 
reduced by at least one compared to the subspace spanned by the s„ £ S' . For s, y £ S” 
again uf z a (t) = iif s a (t) = u'Js a (t) = u''Js a (t) < 0 is valid, u' remains bounded, 
if u" remains bounded. We have the same problem as in the beginning, but within a 
reduced subspace. Either case i), ii), or iii) applies. If again case iii) applies, it will 
again lead to the same problem, but within a subspace reduced even further. After a 
finite number of these iterations the dimension of the respective subspace will be one. 
Then only case i) or ii) can apply and, hence, | |u| | will stay bounded. 

It is possible to show that the Oit 1 ) convergence bound for tan is a tight bound. 
Due to the limited space we have to present the proof in a follow-up paper. 



4 Computer Experiments 

To illustrate these bounds with computer experiments, we measured the convergence 
of tan 7 t on two artificial data sets. Both data sets consisted of N = 1000 patterns 
x„ £ M r> , half of them belonging to class +1 and 1, respectively. The pattern space 
was two-dimensional ( D = 2) for the first data set, and 100-dimensional ( D = 100) 
for the second one. 
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Each data set was generated as follows: a random normal vector for the maximum 
margin hyperplane was chosen. On a hypersquare on this hyperplane with a sidelength 
of 2 the N = 1000 random input patterns were generated. Then half of them were 
shifted to one halfspace (class +1) by a random amount uniformly chosen from the 
interval [0.1, 1], and the other half was shifted to the other halfspace (class 1) by 
a random amount uniformly chosen from [ 0.1, 1]. To make sure that the chosen 

normal vector indeed defines the maximum margin hyperplane, for 30% of the patterns 
a margin of exactly 0.1 was chosen. 





Fig. 1. Double-logarithmic plot of the angle 7 1 between the maximum margin hyperplane and the 
hyperplane provided by MinOver against the number of learning steps t. After a finite number of 
learning steps the plot follows a line of slope 1 , which demonstrates the 0(t *) convergence. 
For comparison the old <D(t x ^ 2 ) convergence bound is shown. At the end of the learning proce- 
dure tan 7 t is about three orders of magnitude smaller than predicted by the old 0(t 1 // 2 )-bound. 



After each training step we calculated tan 7 1 of the angle 7 t between the known 
maximum margin hyperplane and the hyperplane defined by v t . The result for both 
data sets is shown in Fig. 1. To visualize the convergence rate we chose a double loga- 
rithmic plot. As expected, in this double logarithmic plot convergence is bounded by a 
line with a slope of 1, which corresponds to the 0(t 1 ) convergence we have proven. 
For comparison we also plotted the 0(t 1/2 (-bound given by Equation (6), which cor- 
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responds to a line of slope 1/2. After 100.000 training steps tan7t is about three 
orders of magnitude smaller than predicted by the old 0(t 1/ 2 )-bound. 

5 Conclusions 

The well-known MinOver algorithm as a simple and iterative procedure for obtaining 
maximum margin hyperplanes has been reformulated for the purpose of support vec- 
tor classification with and without kernels. We have given an alternative proof for its 
well-known 0(t J / 2 ) convergence. Based on this proof we have shown that the Min- 
Over algorithm converges even at least as 0(t 1 ) with increasing number of learning 
steps. We illustrated this result on two artificial data sets. With such a guarantee in con- 
vergence speed, with its simplicity, and with a computational effort which scales like 
O(N) with the number of training patterns the MinOver algorithm deserves a more 
widespread consideration in applications. 

Acknowledgment. The author would like to thank Kai Labusch for his help preparing 
the manuscript. 
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Abstract. We propose an approach to categorize real-world natural scenes based 
on a semantic typicality measure. The proposed typicality measure allows to grade 
the similarity of an image with respect to a scene category. We argue that such 
a graded decision is appropriate and justified both from a human's perspective 
as well as from the image-content point of view. The method combines bottom- 
up information of local semantic concepts with the typical semantic content of an 
image category. Using this learned category representation the proposed typicality 
measure also quantifies the semantic transitions between image categories such as 
coasts, rivers/lakes, forest, plains, mountains or sky/ clouds. The me- 
thod is evaluated quantitatively and qualitatively on a database of natural scenes. 
The experiments show that the typicality measure well represents the diversity of 
the given image categories as well as the ambiguity in human judgment of image 
categorization. 



1 Introduction 

Scene categorization or scene classification is still a challenge on the way to reduce 
the semantic gap between “the information that one can extract from visual data and 
the users’ interpretation for the same data in a given situation” [1], In the context of 
this paper, scene categorization refers to the task of grouping images into semantically 
meaningful categories. But what are “semantically meaningful” categories? In image 
retrieval, meaningful categories correspond to those basic-level image categories that 
act as as a starting point when users describe verbally the particular image they are 
searching for. In general however, any natural scene category will be characterized by 
a high degree of diversity and potential ambiguities. The reason is that those categories 
depend strongly on the subjective perception of the viewer. 

We argue that high categorization accuracies should not be the primary evaluation 
criterion for categorization. Since many natural scenes are in fact ambiguous, the catego- 
rization accuracy only reflects the accuracy with respect to the opinion of the particular 
person that performed the annotation. Therefore, the attention should also be directed 
at modeling the typicality of a particular scene. Here, typicality can be seen as a mea- 
sure for the uncertainty of annotation judgment. Research in psychophysics especially 
addresses the concept of typicality in categorization. In each category, typical and less 
typical items can be found with typicality differences being the most reliable effect in 
categorization research [2]. 
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coasts rivers/lakes forests plains mountains sky/clouds 



Fig. 1 . Images of each category. Top three rows: typical images. Bottom row: less typical image. 



We propose a semantic typicality measure that grades the similarity of natural real- 
world image with respect to six scene categories. Furthermore, the typicality measure 
allows to categorize the images into one of those categories. Images are represented 
through the frequency of occurrence of nine local semantic concepts. Based on this 
information, a prototypical category representation is learned for each scene category. 
The proposed typicality measure is evaluated both qualitatively and quantitatively using 
cross-validation on an image database of 700 natural scenes. 

Previous research in scene classification usually aims for high classification ac- 
curacies by using very “clean” databases. Early research covers city/landscape- [3], 
indoor/outdoor- [4] and indoor/outdoor-, city/landscape-, sunset/mountain/forest-classi- 
fication [5]. These approaches employ only global image information rather than local- 
ized information. The goal of more recent work is to automatically annotate local image 
regions [6]-[9], but the majority does not try to globally describe the retrieved images. 
Oliva and Torralba [10] attach global labels to images based on local and global features, 
but do not use any intermediate semantic annotation. 

In the next section, we discuss the selection of image categories. The image and 
category representations are introduced in Section 3 and 4. We present our typicality 
measure and the categorization based on it in Section 5. Section 6 summarizes the 
categorization results using automatically classified image subregions as input. Finally, 
Section 7 shows the categorization performance visually on new, unseen images. 



2 Basic Level Image Categories 

Our selection of scene categories has been inspired by work in psychophysics. In search 
of a taxonomy of environmental scenes, Tversky and Hemenway [11] found indoors 
and outdoors to be superordinate-level categories, with the outdoors category be- 
ing composed of the basic-level categories city, park, beach and mountains. The 
experiments of Rogowitz et al. [12] revealed two main axes in which humans sort 
photographic images: human vs. non-human and natural vs. artificial. For our experi- 
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ments, we selected the non-human/natural coordinate as superordinate and extended the 
natural, basic-level categories of [11] to coasts, rivers/lakes, forests, plains, 
mountains and sky/clouds. The diversity of those categories is illustrated in Figure 
1 . It displays a sample of images for each category. The top three lines correspond to 
typical examples for each category. The bottom line shows images which are far less 
typical but which are - arguably - still part of the respective category. Obviously, those 
examples are more difficult to classify and literally correspond to borderline cases. In the 
following, we aim for a semantic typicality measure based on the global composition of 
local semantic concepts which reflects that those less typical images in the bottom part 
of Figure 1 are less similar to the semantic category than those, more typical images in 
the upper part of the figure. 



3 Image Representation 

Many studies have shown that in categorization, members and non-members form a 
continuum with no obvious break in people’s membership judgment. Quite importantly, 
typicality differences are probably the strongest and most reliable effect in the catego- 
rization literature [2], For example it has been found that typical items were more likely 
to serve as cognitive reference points [13] and that learning of category representations 
is faster if subjects are taught on mostly typical items than if they are taught on less 
typical items [14]. In our opinion, any successful category representation has to take 
these findings into account and should be consistent with them. 

A representation which is predictive of typicality is the so-called "attribute score" 
[15]. That is, items that are most typical have attributes that are very common in the 
category. In this approach each attribute is weighted in order to take into account their 
respective importance for the category. In our case, it is the local semantic concepts that 
act as scene category attributes. By analyzing the semantic similarities and dissimilari- 
ties of the aforementioned categories, the following set of nine local semantic concepts 
emerged as being most discriminant: sky, water, grass, trunks, foliage, field, rocks, flow- 
ers and sand. In our current implementation, the local semantic concepts are extracted 
on an arbitrary regular 10x10 grid of image subregions. For each local semantic concept, 
its frequency of occurrence in a particular image is determined and each image is rep- 
resented by a so-called concept occurrence vector. Figure 2 shows an exemplary image 
with its local semantic concepts and its concept occurrence vector. Since the statistics 
of the local semantic concepts vary significantly when analyzing certain image areas 
separately (e.g. top/middle/bottom), we evaluate the concept occurrence vector for at 
least three image areas. 

Database. The database consists of 700 images in the categories coasts, forests, 
rivers/lakes, plains, mountains and sky/clouds. All image subregions have 
been manually annotated with the above mentioned nine local semantic concepts. 



4 Prototypical Representation of Scene Categories 

The representation of the scene categories should take into account the typicality ef- 
fect and the prototype phenomenon. A category prototype is an example which is most 
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Concept Occurrences foliage 31.0% 

sky 21.5% fields 0.0% 

water 32.0% rocks 15.5% 

grass 0.0% flowers 0.0% 

trunks 0.0% sand 0.0% 



Fig. 2. Image representation 



Prototypes of coasts (solid) and forests (dashed) 





Fig. 3. Prototypes and standard deviations of 
the scene categories 



typical for the category, even though the prototype is not necessarily an existing cat- 
egory member. Given the image representation presented in the previous section, a 
prototypical representation for the six scene categories can be learned. This prototypical 
representation allows to grade the different members of the category by an appropriate 
semantic typicality measure. The measure takes into account the occurrence statistics of 
the semantic concepts and weights them according to their variance within the category. 
The prototypical representation corresponds to the means over the concept occurrence 
vectors of all category members. Figure 3 displays those prototypes and the standard 
deviations for all categories. From this figure, it becomes apparent which local semantic 
concepts are especially discriminant. For example, forests are characterized through 
a large amount of foliage and trunks, whereas mountains can be differentiated when a 
large amount of rocks is detected. 

5 Typicality and Categorization 

The proposed category representation has the advantage of not representing binary de- 
cisions about the semantic concepts being present in the image or not (“Yes, there are 
rocks.” vs. “No, there are no rocks”). Instead it represents soft decisions about the de- 
gree to which a particular semantic concept is present. The distances of the category 
members to the prototypical representation thus allow to assess the typicality of these 
images without excluding them from the category. There might be mountains scenes 
that hardly contain any rocks, but quite some foliage. They do belong to the mountains 
category, but they are much less typical than mountains scenes that contain a larger 
amount of rocks. In fact, they might be quite close to the borderline of being forest 
scenes. 

The image typicality is measured by computing the Mahalanobis distance between 
the images’ concept occurrence vector and the prototypical representation. All experi- 
ments have been 10-fold cross-validated. Hence, the category prototypes are computed 
on 90% of the database. All following depicted images belong to the respective test sets. 
Figures 4, 5, and 6 show the transitions between two categories with the typicality dis- 
tance measure printed below the images, normalized to the range [0, 1]. A value close to 
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D = 0.06 



D = 0.29 



D = 0.34 



D = 0.81 



D = 0.83 



D = 0.95 



Fig. 4. Transition from rivers/lakes to forests with normalized typicality value 




D = 0.05 D = 0. It D = 0.39 D = 0.48 D=0.62 D = 0.87 



Fig. 5. Transition from forests to mountains with normalized typicality value 




D = 0.11 D = 0.40 D = 0.49 D = 0.67 D = 0.77 D = 0.82 



Fig. 6. Transition from mountains to rivers/lakes with normalized typicality value 

0 corresponds to a close similarity of the particular image to the first of the two categories 
and vice versa. For example Figure 4 shows clearly the increase in “f orest-ness” from 
left to right. Figure 5 depicts the transition from forests to mountains and Figure 6 
the transition from mountains back to rivers/lakes. 

With the typicality measure, also the categorization of unseen images can be carried 
out. For a new image, the similarity to the prototypical representation of each category is 
computed and the image is assigned to the category with the smallest distance. Table 2(a) 
shows the confusion matrix for the categorization of the annotated database images (10- 
fold cross-validated) resulting in an overall categorization rate of 89.3%. The analysis 
of the mis-categorized images shows that most of the confusions can be explained due 
to similarities of the different categories. Another way to evaluate the performance is to 
use the rank statistics of the categorization shown in Table 2(a). Using both the best and 
the second best match the categorization rate raises to 98.0%. This proves that images 
which are incorrectly categorized as first match are on the borderline between two similar 
categories and therefore most often correctly categorized with the second best match. It 
is also true, that the typicality values of those two matches are often very close to each 
other. 

6 Categorization of Classified Images 

The categorization experiment of the previous section was carried out using the manually 
annotated images of our database. In this section, we discuss the categorization results 
when the 70’000 image subregions have automatically been classified into one of the 
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Table 1. Categorization Confusion Matrix and Rank Statistics - Annotated Image Regions 
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(a) Confusion Matrix 






(b) Rank Statistics 





Table 2. Categorization Confusion Matrix and Rank Statistics - Classified Image Regions 
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(a) Confusion Matrix (b) Rank Statistics 



semantic concept classes. The subregions are represented by a combined 84-bin linear 
HSI color histogram and a 72-bin edge direction histogram, and classified by a k-Nearest 
Neighbor classifier. Features and classifier have been selected through an extensive series 
of experiments. These pre-tests also showed that the use of neighborhood information in 
face decreases the overall classification accuracy since it penalizes concepts that appear 
as “singularities” in the image instead of as contiguous regions (e.g. trunks or grass). 
The classification accuracy of the concept classification is 68.9%. 

The prototypical representation of the categories is computed on ten image areas (ten 
rows from top to bottom, see Section 3). Both concept classification and categorization 
are 10-fold cross-validated on the same test and training set. That is, a particular training 
set is used to train the concept classifier and to learn the prototypes. The images of 
the corresponding test set are classified locally with the learned concept classifier and 
subsequently categorized. 

The overall categorization rate of the classified images is 67.2%. The corresponding 
confusion matrix is displayed in Table 3(a). A closer analysis of the confusions leads 
to the following insights. Good and less good categorization is strongly correlated with 
the performance of the concept classifier that is most discriminant for the particular 
category. Three of the six categories have been categorized with high accuracy: forest, 
mountains and sky/ clouds. The reason is that the important local concepts for those 
categories, that is sky, foliage and rocks have been classified with high accuracy and thus 
lead to a better categorization. Critical for the categorization especially of the category 
plains is the classification of fields. Since fields is frequently confused with either 
foliage or rocks, plains is sometimes mis-categorized as forests or mountains. 
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mountains rivers/lakes plains 



Fig. 7. Examples for “correctly” categorized images. 




forest (instead of rivers/lakes) mountains (instead of forest) coasts (instead of rivers/lakes) 



Fig. 8. Examples for “incorrectly” categorized images. 



Another semantic concept that is critical for the categorization is water. If not enough 
water is classified correctly, rivers/lakes images are confused with forests or 
mountains depending on the amount of foliage and rocks in the image. If too much 
water has incorrectly been detected in rivers/lakes images, they are confused with 
coasts. 

Table 3(b) displays the rank statistics for the categorization problem. When using 
both the best and the second best match, the categorization rate is 83.1%. As before with 
the labeled data, there is a large jump in categorization accuracy from the first to the 
second rank. This leads to the conclusion that the wrong classifications on subregion 
level move many images closer to the borderline between two categories and thus cause 
mis-categorizations. 



7 More Categorization Results 

In order to verify the results of Section 6, both concept classification and scene catego- 
rization were tested on new images that do not belong to the cross-validated data sets. 
Exemplary categorization results are displayed in Figure 7 and Figure 8. “Correctly” and 
“incorrectly” are quoted on purpose since especially Figure 8 exemplifies how difficult it 
is to label the respective images. When does a forest-scene become a rivers/lakes- 
scene or a mountains-scene a forest-scene? The reason for the “mis-categorization” 
of the first image in Figure 8 is that a large amount of water has been classified as fo- 
liage thus moving the scene closer to the f orest-prototype. The reason for the other two 
“mis-categorizations” is the ambiguity of the scenes. The typicality measure returned 
for all three images in Figure 8 low confidence values for either of the two relevant 
categories whereas the typicality value for the scenes in Figure 7 is higher. This shows 
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that we are able to detect difficult or ambiguous scenes using our image representation 
in combination with the typicality measure. 



8 Discussion and Conclusion 

In this paper, we have presented a novel way to categorize natural scenes based on a 
semantic typicality measure. We have shown that it is indispensable both from a human’s 
perspective and from a system’s point of view to model the local content and thus the 
diversity of scene categories. With our typicality measure ambiguous images can be 
marked as being less typical for a particular image category, or the transition between 
two categories can be determined. This behavior is of interest for image retrieval systems 
since humans are often interested in searching for images that are somewhere “between 
mountains and rivers/lakes, but have no flowers”. 

Considering the diversity of the images and scene categories used, classification rates 
of 89.3% with annotated concept regions and 67.2% using semantic concept classifiers 
are convincing. By also including the second best match in the categorization, an increase 
to 98.0% and 83.1%, respectively, could be reached. In particular this latter result reveals 
that many of the images misclassified with the first match are indeed at the borderline 
between two semantically related categories. This supports the claim that we are able 
to model the typicality and ambiguity of unseen images. The results also show that the 
categorization performance is strongly dependent on the performance of the individual 
concept classifiers which will be the topic of further research. 
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Abstract. A tunable nearest neighbor (TNN) classifier is proposed to 
handle the discrimination problems. The TNN borrows the concept of 
feature line spaces from the nearest feature line (NFL) classifier, to make 
use of the information implied by the interaction between each pair of 
points in the same class. Instead of the NFL distance, a tunable distance 
metric is proposed in the TNN. The experimental evaluation shows that 
in the given feature space, the TNN consistently achieves better perfor- 
mance than NFL and conventional nearest neighbor methods, especially 
for the tasks with small training sets. 



1 Introduction 

We address a discrimination problem with N labeled training samples origi- 
nated from C categories. Denote the training set as X = {{a;?}^} ^ where 

represents the sample subset for the c-tlr class and N c is the subset’s 
size which satisfy N = X) c =i N- The task is to predict the class membership of 
an unlabeled sample x. 

The fc-nearest-neiglrbor (fc-NN) method[4] is a simple and efficient approach 
to this task. We find the fc nearest neighbors of x in the training set and classify 
x as the majority class among the k nearest neighbors. In a given feature space, 
it’s very important to select an appropriate distance metric for fc-NN. 

There have been various distance metrics used in fc-NN, which can be divided 
into two categories. The distance metrics in the first category are defined be- 
tween an unlabeled point and a labeled point in the feature space, e.g. Euclidean 
distance, Hamming distance, Cosine distance, Kullback-Leibler (KL) distance [8], 
etc. Using these distance metrics, the training points are regarded as some iso- 
lated ones in the feature space. Hence, some useful information implied by the 
interaction of samples is ignored. Different from the first category, the distance 
metrics in the second category make use of some prior knowledge for the whole 
training set, such as Malralanobis distance, Quadratic distance. Especially, a dis- 
criminant adaptive nearest neighbor (DANN) classification method is proposed 
in [5], where a local linear discriminant analysis (LDA) is adopted to estimate 
an effective local quadratic distance metric to find neighbors. However, these 
distance metrics are effective only if the training set is large enough. 

In this paper, we consider the discrimination problem with multiple but 
very limit samples for each class, e.g. face recognition task. In these problems, 
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1-NN (also called as NN for simplicity) is frequently adopted because of the 
small training set. And the mentioned distance metrics in the second category 
are inappropriate. In [1] [2] , a nearest feature line (NFL) method is proposed to 
make use of the information implied by each pair of points in the same class by 
constituting some feature line (FL) spaces. The NFL distance is defined as the 
Euclidean distance between an unlabeled point and its projection onto the FL. 
The experimental results have shown the NFL can consistently produce superior 
results over the NN methods based on conventional distances[l][2]. However, the 
NFL distance will bring some problems, which weaken the NFL’s performance 
in some cases such as the example in Fig. 1. A tunable nearest neighbor (TNN) 
method is proposed to strengthen the original NFL by using a new distance 
metric which can be tuned through a parameter. The parameter selection pro- 
cess is robust using cross-validation. Our experimental results substantiate the 
efficiency of the TNN , especially in the cases when only a small training set is 
available and the data distribution in the feature space is nonlinear. 

2 Related Work 

A discriminant adaptive nearest neighbor (DANN) method is proposed in [5], 
where a local LDA metric for the test point Xq is learned using its nearest 
neighbor points through an iterative process. At completion, use the distance 
d(x, Xq) = (x — xq) t Eq(x — xq) to obtain xo's fc-nearest neighbors for classifi- 
cation. Obviously, some prior knowledge has been introduced to the DANN. The 
DANN classifier can be expected to achieve better performance than the conven- 
tional NN classifiers. However, a large sample set is needed for good estimations 
of the local Quadratic metrics. 

The nearest feature line (NFL) method[l][2] constructs some feature line 
spaces to make use of the information implied by the interaction between each 
pair of points in the same class. A feature line (FL) is defined as a straight line 
x^Xj passing through two points x° and Xj which belong to the same class in 
the feature space (see Fig. 2). All FLs in the same class constitute a FL Space 
of that class, S c = {x?Xj\l <i,j < N c ,i ^ j}, and there are C FL spaces. 

In the NFL classifier, the distance between an unlabeled point and its 
projection onto the FL is calculated and used as the metric. Let x^’ 13 rep- 
resent the projection of the test point x to the FL x?Xj , as shown in Fig. 
2. Then the NFL distance is described as dNFL{x,x?Xj) =|| x — x ^ ||. 
According to the NN rule, x is classified into the class c° , which satisfies 
d,NFL{x,xfoX c ° 0 ) = min min dNFL{x,x^x°:). 

3 l<c<C l<i,j<N c ,i^j 1 3 

Using the NFL distance dNFL in the FL spaces is equivalent to extending 
each pair of training points in the same class to an infinite number of points lying 
on the corresponding FL, using interpolation and extrapolation. And this infinite 
extension of the original training set will bring some problems [3]. That is, the 
extension part of one class has a possibility to cross other classes, especially 
in the nonlinear cases such as the example illustrated in Fig. 1. In practice, 
this problem can be partly handled by using only close neighbors in the same 
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(a) (b) 

Fig. 1. The points come from two categories: class 1 denoted by circles and class 2 
denoted by asterisks, (a) Five training points are randomly selected from each class, 
denoted by solid triangles and squares respectively, (b) The feature line spaces. As we 
can see, the extended parts of class 1 and class 2 are interwoven. 

X 




Fig. 2. The contours for an FL x°Xj with different distance metrics. Two real-line circles 
are for Euclidean distance. Two parallel dashdotted lines are for the NFL distance. 



class to create FL spaces. For the example illustrated in Fig. 1, if only the FLs 
constituted by each sample and its nearest neighbor in the same class are used, 
the FL space of class 2 will not cross that of class 1. However, the FL space of 
class 1 will still cross that of class 2. In addition, if the training set is too small, 
the usable FLs will be very limit. Hence, in the next section, we will handle this 
problem by designing a new distance metric in the original FL spaces. 



3 Tunable Nearest Neighbor Method (TNN) 

Similar to the NFL, the NN based on Euclidean distance can also be refor- 
mulated in the FL space by setting the distance metric as (Inn{x,x^x^) = 
min {d(x,Xi),d(x,Xj)}. However, it does not make use of the virtue of the FL 
spaces. The reason is that while calculating the distance dNN(x,x^x^), the pair 
of points, x\ and Xj, are treated as isolated ones. 

Let us discuss the effects of various distance metrics in the FL spaces using a 
concept of equal-distance surface (also called a contour in 2-dimensional cases) . 
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An equal-distance surface for an FL is defined as a surface in the feature space 
on which the points have the same distances to the FL. For a 2-dimensional 
case illustrated in Fig. 2, the contour for an FL with Euclidean distance consists 
of two circles or a close curve formed by the intersection of two circles, and 
the contour for an FL with the NFL distance is two parallel lines. Obviously, 
the equal-distance surfaces for an FL, which reflect the form of the interaction 
between the corresponding pair of points, should be adapted to the given data 
set. Hence, though the NFL achieves appealing results in many cases, it may 
perform poorly in some cases such as the example illustrated in Fig. 1. 

Here, we propose a new distance metric which can be tuned through a 
parameter. The new distance from x to the FL x^x'j is calculated as follow: 

— Calculate two ratios: rq = | |ar — x^\\/\\x^ — x?||, and r 2 — ||£ — £j||/||Xj — x?||; 

— Set (x m = (r m )“ (m = 1,2 and a > 0 ). Get two points x %? 3 (m = 1,2) on 

the FL x?x9 as x c { %3 = x\ + ixi(xj - xf] and x ^ = x? + x £ - Xj); 

— Let d,TNN{x,x?Xj) = mm{d(x,Xi t3 ),d(x,X 2 Z3 )}- 

Tuning a is equivalent to adjusting the equal-distance surface forms for the 
FL lines. The contours for an FL using the TNN distance <1tnn with different 
values for a, are illustrated in Fig. 3. If a equals to zero, cItnn is equivalent to 
dwN- When a is near unit, the equal-distance surfaces for the FL using dxNN 
are similar to that using d^FL- As a gets larger, the equal-distance surfaces 
will become fatter, which indicates that the interaction between each pair is 
gradually eased up. And when a is large enough, drNN will approximate d/vjv- 
The change of the equal-distance surface form is continuous. 

Specially, when a = 2, the TNN distance turns into 

d T NN(x,xfx f) =|| X — x c i II • II x — X C j II / II x\ - x) II (1) 

Using dxNN with a = 2, the interaction between each pair of points is moderate. 
In many cases, 2 is a recommendable value for a in dxNN, though it may not 
be the optimum value. 

According to the NN rule, x is classified into the class c°, which satisfies 

dr n n {x , x^o x'fo) = min min dTNN(x.xfx^). We call the NN classifier 
J 1 <c<C l<i,j<N c ,ijtj 3 

based on drNN as the tunable nearest neighbor (TNN) classifier. 

The classification results for the example in Fig. 1 with 3 distance metrics 
are shown in Fig. 4, which show that the TNN has better adaptability to this 
specific data set than the NFL and the NN based on Euclidean distance. 



4 Experimental Evaluation 

To substantiate the efficiency of the TNN, we apply it to real data classification 
tasks. Here, we evaluate TNN’s performance over some UCI datasets and the 
AR face database, versus NFL, NN and fc-NN based on Euclidean distance. 
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(c) 



(d) 



Fig. 3. The contours for an FL using <1tnn with a equaling to (a) 1.0, (b) 1.5, (c) 
2.0, and (d) 10.0. Two points denoted by asterisks are the pair used to construct the 
FL. 





Fig. 4. The classification results for the example illustrated in Fig. 1 with five training 
samples per class, using the NN classifiers based on (a)Euclidean distance, (b)the NFL 
distance, and (c) the TNN distance with a = 2. 

4.1 UCI Datasets 

We select some datasets from the UCI data repository which satisfy the follow- 
ing requirements: (l)There are no missing features in the data; (2)The sample 
number for each class is not large. Many people’s results on each dataset have 
been reported to evaluate the performance of various algorithms. However, the 
experimental settings are not always the same. Hence, we don’t compare our 
results with those results reported elsewhere. In our experiment, each dataset is 
randomly divided into 5 disjoint subsets of equal size. For each time, we select 
three subsets to constitute the training set and treat the rest as the testing set. 
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(a) (b) 

Fig. 5. Experimental results over five UCI datasets, (a) Recognition rate of TNN over 
(l)Wine, (2)Ionosphere, (3)Spectf, (4)Sonar and (5)Diabetes, versus NFL, NN and k- 
NN ( k = 3, 7) based on Euclidean distance. (b)The TNN’s average recognition rate 
curves against the parameter a. 

There are totally 10 different trials over each dataset. Using the results of these 
trials, we can calculate the mean and standard deviation of the recognition rates. 

Nearly over all selected datasets, TNN performs better than NFL, NN and 
fc-NN (k = 3, 7) based on Euclidean distance. For limited space, we only list 
the results on the following five datasets in Fig. 5: ‘Wine’, ‘Ionosphere’, ‘Spectf’, 
‘Sonar’ and ‘Diabetes’. The optimum parameters a 0 for TNN obtained through 
cross-validation are near 2.0 over all these five datasets. The average recognition 
rate curves for TNN against the parameter a over these datasets are shown in 
Fig. 5(b). From these curves, we can observe 3 remarkable facts: 1) The recog- 
nition rates of TNN with a = 1 is comparable to that of NFL; 2) As a becomes 
large enough, the recognition rate of TNN will be stable; 3) The recognition rate 
curve against the parameter a varies smoothly around the optimum value. 

4.2 Face Recognition 

Face recognition task is carried out over AR face database. In the experiments, 
50 persons are randomly selected from the total 126 persons and 7 frontal view 
faces with no occlusions are selected from the first session for each person. We 
have manually carried out the localization step, followed by a morphing step so 
that each face occupies a fixed 27 x 16 array of pixels. And they are converted 
to gray-level images by adding all three color channels, i.e. , I = (R + G + B)/ 3. 
The principle component analysis (PC A) [7] is adopted here for dimensionality 
reduction. Hence, the data set is transformed into d-dimensional PCA space. 
Different from [7], the PCA features are normalized using the corresponding 
eigenvalues here. (Note that in the un-normalized PCA feature space, TNN can 
also achieve a similar increase in recognition rate compared with NN and NFL. 
However, the normalized PCA features are more suitable for this face database.) 
Three samples per subject are randomly selected as the training samples and 
the rest as the testing ones. The procedure is repeated for 10 times. 
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Principle component number 



(a) (b) 

Fig. 6. Experimental results over AR face data with various PC numbers d. 
(a)Recognition rate of TNN, versus NFL and NN based on Euclidean distance. (b)The 
TNN’s average recognition rate curves against the parameter a with d = 20, 40, 60. 



We evaluate the performance of TNN, NFL and NN based on Euclidean 
distance with various principle component (PC) numbers. The optimum values 
for a are all close to 2.0. As shown in Fig. 6(a), in this experiment, TNN is 
comparable to NFL, and both of them are superior to NN based on Euclidean 
distance, nearly with an 8 percent increase in recognition rate. The TNN’s av- 
erage recognition rate curves against the parameter a are shown in Fig. 6(b), 
with d = 20, 40, 60. From these curves, we can observe the similar facts as have 
been referred to in the first experiment. 



4.3 Discussions on the Parameter Selection and Computational 
Load 

Besides the two experiments in this paper, the discrimination results over many 
other data sets, which are not presented here for limited space, exhibit some 
common facts. These facts accord well with the characteristics of the equal- 
distance surfaces for an FL using dxNN with different a, e.g. 

— The recognition rates of TNN with a = 1 is comparable to that of NFL. The 
underlying reason is that when a is near unit, equal-distance surfaces for an 
FL using dxNN are similar to that using d^FL- 

— The recognition rate curve against the parameter a is smooth. It may be 
because of the continuous change of the equal-distance surface form against 
a, as illustrated in Fig. 3. 

— TNN with a = 2 consistently achieves better performance than NN based on 
Euclidean distance. Hence, 2 is a recommendable value for a. As has been 
referred to in section 3, the interaction between each pair of points in the 
same class is moderate when using dxNN with a = 2. 

— As a becomes large enough (a > 10), the recognition rate of the TNN will 
be stable, which also accords with the trend of the equal-distance surfaces’ 
evolution against a. 
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Through these facts and analysis, we can expect that TNN will perform 
better than NFL and NN based on Euclidean distance. And because of TNN’s 
gradual change of recognition rates against a, we can adopt a large step size to 
search for the optimum value for a in a limited interval (generally set as [0, 5]). 

Similar to NFL, TNN has to calculate the TNN distance for M times to 
classify a sample, where M = X^ c =i A), ( N c — l)/2. Hence, TNN’s computational 
load is obviously heavier than that of NN. However, in the task with small 
training set, the increase of the computational load is tolerable. On the other 
hand, cross validation is used to determine the optimal value for a, which also 
adds complexity to TNN in the training process. This problem exists in nearly 
all the methods which need to select optimal parameters for the algorithms. 

5 Conclusions 

A tunable nearest neighbor (TNN) method is proposed to make use of the infor- 
mation implied by the interaction between each pair of points in the same class. 
The TNN borrows the concept of feature line (FL) from the nearest feature line 
(NFL) method. Instead of the NFL distance in the NFL, a tunable distance 
metric is defined in the TNN. Hence, the effect caused by the interaction be- 
tween each pair of points can be adjusted to adapt well to the given data set. 
The parameter selection process is predictable and robust using cross-validation. 
Moreover, there is a recommendable value, 2, for the parameter. The experimen- 
tal results show that in the given feature space, the TNN consistently achieves 
better performance than the NFL and the NN based on Euclidean distance. 
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Abstract. We propose various novel embedded approaches for (simulta- 
neous) feature selection and classification within a general optimisation 
framework. In particular, we include linear and nonlinear SVMs. We ap- 
ply difference of convex functions programming to solve our problems 
and present results for artificial and real-world data. 



1 Introduction 

Overview and related work. Given a pattern recognition problem as a train- 
ing set of labelled feature vectors, our goal is to find a mapping that classifies 
the data correctly. In this context, feature selection aims at picking out some of 
the original input dimensions ( features ) (i) for performance issues by facilitating 
data collection and reducing storage space and classification time, (ii) to per- 
form semantics analysis helping to understand the problem, and (iii) to improve 
prediction accuracy by avoiding the ’’curse of dimensionality” (cf. [6]). 

Feature selection approaches divide into filters that act as a preprocessing 
step independently of the classifier, wrappers that take the classifier into account 
as a black box, and embedded approaches that simultaneously determine features 
and classifier during the training process (cf. [6]). In this paper, we deal with the 
latter method and focus on direct objective minimisation. Our linear classifica- 
tion framework is based on [4], but takes into account that the Support Vector 
Machine (SVM) provides good generalisation ability by its -O-regula, riser. There 
exist only few papers on nonlinear classification with embedded feature selection. 
An approach for the quadratic 1-norm SVM was suggested in [12]. An example 
for a wrapper method employing a Gaussian kernel SVM error bound is [11]. 
Contribution. We propose a range of new embedded methods for feature se- 
lection regularising linear embedded approaches and construct feature selection 
methods for nonlinear SVMs. To solve the non-convex problems, we apply the 
general difference of convex functions (d.c.) optimisation algorithm. 

Structure. In the next section, we present various extensions of the linear em- 
bedded approach proposed in [4] and consider feature selection methods in con- 
junction with nonlinear classification. The d.c. optimisation approach and its 
application to our problems is described in Sect. 3. Numerical results illustrat- 
ing and evaluating various approaches are given in Sect. 4. 



C.E. Rasmussen et al. (Eds.): DAGM 2004, LNCS 3175, pp. 212-219, 2004. 
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2 Feature Selection by Direct Objective Minimisation 

Given a training set {(x,, yj £ X x {—1,1} : i = 1 with X C R d , our 

goal is both to find a classifier F : X {—1,1} and to select features. 

2.1 Linear Classification 

The linear classification approaches construct two parallel bounding planes in R d 
such that the differently labelled sets are to some extent in the two opposite half 
spaces determined by these planes. More precisely, one solves the minimisation 
problem 

n 

min (1- A) V'(l-y i (w T x i + 6))+ + Ap(w) (1) 

weR d ,fceR 

2=1 

with A £ [0, 1), regulariser p and x+ := max(x,0). Then the classifier is F(x) = 
sgn(w T x + b). For p = 0, the linear method (1) was proposed as Robust Linear 
Programming (RLP) by Bennett and Mangasarian [2]. Note that these authors 
weighted the training errors by l/n±i, where n±\ = |{i : y* = ±1}|. 

In order to maximise the margin between the two parallel planes, the original 
SVM penalises the t^-norm p( w) = i||w|||. Then (1) can be solved by a convex 
Quadratic Program (QP). 

In order to suppress features, £ p -norms with p < 2 are used. In [4], the 
G-norm (lasso penalty) p( w) = llwl^ leads to good feature selection and classi- 
fication results. Moreover, for the f i-norm, (1) can be solved by a linear program. 

The feature selection can be further improved by using the so-called £q- 
“norm” ||w|| 0 = |{« : w, ^ 0}| [4,10]. Since the fo-norm is non-smooth, it was 
approximated in [4] by the concave functional 

p(w)=e T (^e-(e-“^')"J«||w|| 0 (2) 

with approximation parameter a £ R + and e = (1, . . . , 1) T . Problem (1) with 
penalty (2) is known as Feature Selection concaVe (FSV). Now the solution of 
(1) becomes more sophisticated and can be obtained, e.g., by the Successive 
Linearization Algorithm (SLA) as proposed in [4]. 

New feature selection approaches. Since the £2 penalty term leads to very 
good classification results while the t\ and £ 0 penalty terms focus on feature 
selection, we suggest using combinations of these terms. As common, to eliminate 
the absolute values in the ti-norm or in the approximate £o‘ llorm ; w e introduce 
additional variables 1 > |w,| ( i = 1 and consider vp(v) + X[_ ViV ] (w) 

instead of Ap(w), where \G denotes the indicator function Xc( x ) = 0 if x £ C 
and Xc{ x ) = 00 otherwise (cf. [7,8]). As a result, for p, u £ R + , we minimise 

/(w,&,v) := ^^(l-yi(w T x !; + &))+ + -||w||^ + j/p(v)+x [ _ v , v] (w) . (3) 

i—l 

In case of the fi-norm, problem (3) can be solved by a convex QP. For the 
approximate £o-uorm an appropriate method is presented in Sect. 3. 
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2.2 Nonlinear Classification 



For problems which are not linearly separable a so-called feature map <f> which 
usually maps the set X C onto a higher dimensional space 4>{X) C 
(i d ' > d ) is used. Then the linear approach (1) is applied in the new feature 
space <fi(X). This results in a nonlinear classification in the original space R d , 
i.e., in nonlinear separating surfaces. 

Quadratic feature map. We start with the simple quadratic feature map 
<f> : X -> , x ^ (x“ : a € Ng , 0 < ||a||i < 2) , 

where d! = d ( d + 3 ' ) f and apply (1) in M. d> with the approximate fo-penalty (2): 



/( w, 6, v) :=(1 - A) ^(1 - 2/ i (w T 0(x J ) + 6))+ + Ae T (e - e av ) 

i= 1 



d! 



+ H H Xh v^v^Wi) 

i=1 4>i(ej)^ 0 



min 

weR d ' ,beR,veR d 



(4) 



where e ;/ £ M. d denotes the j-tli unit vector. We want to select features in the 
original space due to (i)-(ii) in Sect. 1. Thus we include the appropriate 
indicator functions. A similar approach in [12] does not involve this idea and 
achieves only a feature selection in the transformed feature space . We will 
refer to (4) as quadratic FSV. In principle, the approach can be extended to 
other feature maps <f > , especially to other polynomial degrees. 

Gaussian kernel feature map. Next we consider SVMs with the feature map 
related to the Gaussian kernel 



A'(x, z) = Kg(x, z) = e z lko/ 2CT (5) 

with weighted f 2 -norm ||x||| e = 'f2k=i®k\ x k\ 2 by A'(x,z) = (</>(x), <f>(z)) for 
all x, z £ X. We apply the usual SVM classifier. For further information on 
nonlinear SVMs see, e.g., [9]. Direct feature selection, i.e., the setting of as many 
Ok to zero as possible while retaining or improving the classification ability, is 
a difficult problem. One possible approach is to use a wrapper as in [11]. In 
[5], the alignment A(K,yy T ) = y T Ky/(n||K||i?) was proposed as a measure of 
conformance of a kernel with a learning task. Therefore, we suggest to maximise 
in a modified form y^Ky n where y n = (yi/n yi )™ =1 . Then, with penalty (2), we 
define our kernel-target alignment approach for feature selection as 

f(0) ■■= “(4 ^ A)^y^K e y„ + X^e T (e-e~ a0 ) +X[o,e]W — * min . (6) 

2 d ee R d 

The scaling factors -5, g ensure that both objective terms take values in [0, 1]. 
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3 D.C. Programming and Optimisation 

A robust algorithm for minimising non-convex problems is the Difference of 
Convex functions Algorithm (DCA) proposed in [7]. Its goal is to minimise a 
function / : — > R. U {oo} which reads 

/(x) = g(x) - h(x) — > min , (7) 

x6R J 

where g,h : — > K U {oo} are lower semi-continuous, proper convex functions 

cf. [8]. In the next subsections, we first introduce the DCA and then apply it to 
our non-convex feature selection problems. 

3.1 D.C. Programming 

For g as assumed above, we introduce the domain of g , its conjugate function 
at x £ R d and its subdifferential at z £ R d by doing := {x £ : g(x) < oo}, 

g*(x) := sup x g R d{(x, x) - g(x)} and dg(z) := {x £ R d : g(x) > g{ z) + <x - 
z,x) Vx £ K d }, respectively. For differentiable functions we have that dg(z) = 
{Vg(z)}. According to [8, Theorem 23.5], it holds 

d g(x) = argmax{x T x — g*(x)} , dg*(x) = argmax{x T x — g(x)} . (8) 

xeR^ xeR^ 

Further assume that doing C clom h and dom h* C doing*. It was proved in [7] 
that then every limit point of the sequence produced by the following 

algorithm is a critical point of / in (7): 

Algorithm 3.1: D.C. minimisation Algorithm (DCA )(g,h,tol) 

choose x° £ doing arbitrarily 
for k £ No 

{ select x fc £ dh(x k ) arbitrarily 
select x fc+1 £ <9g*(x fc ) arbitrarily 

if min — x^|! , Xl xk Xi <tol V* = 1, . . . , d 

then return (x fe+1 ) 

We can show ■ but omit this point due to lack of space - that the DCA applied 
to a particular d.c. decomposition (7) of FSV coincides with the SLA. 

3.2 Application to Our Feature Selection Problems 

The crucial point in applying the DCA is to define a suitable d.c. decomposition 
(7) of the objective function. The aim of this section is to propose such decom- 
positions for our different approaches. 

f 2 -4-SVM. A viable d.c. decomposition for (3) with (2) reads 

n . 

g(w,6,v) = ^^(l-g,(w T x 4 + &))++ -|M|2 + X[_v,v](w) , 

2=1 

h(v) = -ve T (e - e" QV ) 

which gives rise to a convex QP in each DCA step. 
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Quadratic FSV. To solve (4) we use the d.c. decomposition 



g(w,b,v) = (l-A)^(l-y 4 (w T </>(x 4 )+6)) + + ^ ^ X[-y, ,vj] (™i) 

i=l i—1 cj) i (ej)y£ 0 

h(v) = — Ae T (e — e -QV ) , 



which leads to a linear problem in each DCA step. 

Kernel-target alignment approach. For the function defined in (6), as the 
kernel (5) is convex in 6, we split / as 



9(8) 

h(8) 



1 ~~ A V"' e -\\x i -x j \\l 0 /2a 2 
2n + i?r_i 

Vi^Vj 



1- A 
2 



n 

E 

i,j = 1 

2/<=2/j 



Xi-X, 



Vi 



2 

2,0 



/2(t 2 



X[0,e] ( 8 ) , 

r {e ~ e > ■ 



Now h is differentiable, so applying the DCA we find the solution in the first 
~k 

step of iteration k as 9 = Wh(O k ). In the second step, we are looking for 

Q k+1 g dg*(Q ) = argma xg{0 T 0 — g(6)} which leads to solving the convex 
non-quadratic problem 



min 

6>e R d 



A V' e -||*,-*il|3,./2 t 2 _ Q TQ k 

2n +i n_i .4^. 

*0 = 1 
Vi^Vj 



subject to 0 < 0 < e 



with a valid initial point 0 < 0 ° < e. We efficiently solve this problem by a 
penalty/barrier multiplier method with logarithmic-quadratic penalty function 
as proposed in [1], 

4 Evaluation 

4.1 Ground Truth Experiments 

In this section, we consider artificial training sets in M 2 and R 4 where y is a func- 
tion of the first two features x\ and X 2 ■ The examples in Fig. 1 show that our 
quadratic FSV approach indeed performs feature selection and finds classifica- 
tion rules for quadratic, not linearly separable problems. For the non-quadratic 
chess board classification problems in Fig. 2, our kernel-target alignment ap- 
proach performs very well, in contrast to all other feature selection approaches 
presented. Remarkably, the alignment functional incorporates implicit feature 
selection for A = 0. In both cases, only relevant feature sets are selected as can 
be seen in the bottom plots. 
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deterministic in R 2 random in K 4 (four normal random variables 

xi , . . . , X 4 with variances 1 , 1 , 1 and 2 ) 



Fig. 1. Quadratic classification problems with y = sgn(a;f + x\ — 1). Top: Training 
points and decision boundaries ( white lines) computed by (4) for A = 0.1, left: irr R 2 , 
right: projection onto selected features. Bottom: Features determined by (4) 




, ,,i , . 

: o 02 ;.s 04 o= :.e 07 o? oo u o oi os 04 os :.s 07 oo s.» 

7- X 

deterministic (xr 6 {— 3,— 1, 1,3} 2 ) 4 random features (same as Fig. 1 right) 



Fig. 2. Chess board classification problems with = (|_^yj mod 2) ® (|_^f J mod 2). 
Top: Training points and Gaussian SVM decision boundaries ( white lines) for a = 1, 
A = 0.1, left: in R 2 , right: projection onto selected features. Bottom: Features deter- 
mined by (6) 








218 J. Neumann, C. Schnorr, and G. Steidl 



Table 1. Statistics for data sets used 



data set 


number of features d 


number of samples n 


class distribution n_|_i/n_i 


wpbc60 


32 


110 


41/ 69 


wpbc24 


32 


155 


28/l27 


liver 


6 


345 


145/200 


Cleveland 


13 


297 


160/137 


ionosphere 


34 


351 


225/126 


pima 


8 


768 


500/268 


bcw 


9 


683 


444/239 



4.2 Real-World Data 

To test all our methods on real-world data, we use several data sets from the UCI 
repository [3] resumed in Table 1. We rescaled the features linearly to zero mean 
and unit variance and compare our approaches with RLP and FSV favoured 
in [4]. 

Choice of parameters. We set a = 5 in (2) as proposed in [4] and a = ^ in 
(5) which maximises the problems’ alignment. We start the DCA with v° = 1 
for the f 2 -^o~SVM, FSV and quadratic FSV and with 0° = e/2 for the kernel- 
target alignment approach, respectively. We stop on v with tol = 10 -5 resp. 
tol = 10~ 3 for Q. We retain one half of each run’s cross-validation training set 
for parameter selection. The parameters are chosen to minimise the validation 
error from In fi £ {0, . . . , 10}, In v £ (—5, . . . , 5}, A £ {0.05, 0.1, 0.2, . . . , 0.9, 0.95} 
for (quadratic) FSV and A £ {0, 0.1, . . . , 0.9} for the kernel-target alignment 
approach. In case of equal validation error, we choose the larger values for (v, /t) 
resp. A. In the same manner, the SVM weight parameter A is chosen according 
to the smallest 1 / A £ {e -5 , e -4 , . . . , e 5 } independently of the selected features. 

The results are summarised in Table 2 where the number of features is deter- 
mined as | [j = 1, . . . , d : \vj\ > 10 _8 }| resp. \{j = 1, . . . , d : \8j\ > 10 -2 }|. It is 
clear that all proposed approaches perform feature selection: linear FSV discards 
most features followed by the kernel-target alignment approach and then the £ 2 - 
£o-SVM, then the f^-h-SVM. In addition, for all approaches the test error is 
often smaller than for RLP. The quadratic FSV performs well mainly for special 
problems (e.g., ’liver’ and ’ionosphere’), but the classification is good in general 
for all other approaches. 

Table 2. Feature selection and classification tenfold cross-validation performance 
(average number of features, average test error [%]); bold numbers indicate lowest 
errors 





1 




linear classification 




1 


nonlinear classification 




RLP 


I FSV j 


fe-h-SVM 


|£ 2 -£o-SVM 


quad 


. FSV 


k.-t. 


align. 


data set 


dim. 


err 


dim. 


err 


dim. 


err 


dim. 


err 


dim. 


err 


dim. 


err 


wpbcGO 


32.0 


40.9 


0.4 


36.4 


12.4 


35.5 


13.4 


37.3 


3.2 


37.3 


3.9 


35.5 


wpbc24 


32.0 


27.7 


0.0 


18.1 


12.6 


17.4 


2.9 


18.1 


0.0 


18.1 


1.9 


18.1 


liver 


6.0 


31.9 


2.1 


36.2 


6.0 


35.1 


5.0 


34.2 


3.2 


32.5 


2.5 


35.4 


Cleveland 


13.0 


16.2 


1.8 


23.2 


9.9 


16.5 


8.2 


16.5 


9.2 


30.3 


3.2 


23.6 


ionosphere 


33.0 


13.4 


2.3 


21.7 


24.8 


13.4 


14.0 


15.7 


32.9 


10.8 


6.6 


7.7 


pima 


8.0 


22.5 


0.7 


28.9 


6.6 


25.1 


6.1 


24.7 


4.7 


29.9 


1.6 


25.7 


bcw 


9.0 


3.4 


2.4 


4.8 


8.7 


3.2 


7.9 


3.1 


5.4 


9.4 


2.8 


4.2 
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5 Summary and Conclusion 

We proposed several novel methods that extend existing linear embedded feature 
selection approaches towards better generalisation ability by improved regular- 
isation and constructed feature selection methods in connection with nonlinear 
classifiers. In order to apply the DCA, we found appropriate splittings of our 
non-convex objective functions. In the experiments with real data, effective fea- 
ture selection was always carried out in conjunction with a small classification 
error. So direct objective minimisation feature selection is profitable and viable 
for different types of classifiers. In higher dimensions, the curse of dimensional- 
ity affects the classification error even more such that our methods will become 
more important here. A further evaluation of lriglr-dimensional problems as well 
as the incorporation of other feature maps is future work. 
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Abstract. During recent years much effort has been spent in incorporating prob- 
lem specific a-priori knowledge into kernel methods for machine learning. A com- 
mon example is a-priori knowledge given by a distance measure between objects. 
A simple but effective approach for kernel construction consists of substituting the 
Euclidean distance in ordinary kernel functions by the problem specific distance 
measure. We formalize this distance substitution procedure and investigate theo- 
retical and empirical effects. In particular we state criteria for definiteness of the 
resulting kernels. We demonstrate the wide applicability by solving several clas- 
sification tasks with SVMs. Regularization of the kernel matrices can additionally 
increase the recognition accuracy. 



1 Introduction 

In machine learning so called kernel methods have developed to state-of-the-art for a 
variety of different problem types like regression, classification, clustering, etc. [14]. 
Main ingredient in these methods is the problem specific choice of a kernel function. 
This choice should ideally incorporate as much a-priori knowledge as possible. One 
example is the incorporation of knowledge about pairwise proximities. In this setting, 
the objects are not given explicitly but only implicitly by a distance measure. 

This paper focusses on the incorporation of such distance measures in kernel func- 
tions and investigates the application in support vector machines (SVMs) as the most 
widespread kernel method. Up to now mainly three approaches have been proposed for 
using distance data in SVMs. One approach consists of representing each training object 
as vector of its distances to all training objects and using standard SVMs on this data 
[5,12], The second method is embedding the distance data in a vector space, regulariz- 
ing the possibly indefinite space and performing ordinary linear SVM classification [5, 
12]. These approaches have the disadvantage of losing the sparsity in the sense that all 
training objects have to be retained for classification. This makes them inconvenient for 
large scale data. 

The third method circumvents this problem by using the Gaussian rbf-kernel and 
plugging in problem specific distance measures [1,4,8,11]. The aim of this paper is to 
formalize and extend this approach to more kernel types including polynomial kernels. 
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The paper is structured as follows: We formalize distance substitution in the next 
section. Statements on theoretical properties of the kernels follow in Section 3 and 
comments on consequences for use in SVMs are given in Section 4. In Section 5 we 
continue with SVM experiments by distance substitution and investigate regularization 
methods for the resulting kernel matrices. We conclude with Section 6 . 

2 Distance Substitution Kernels 

The term kernel refers to a real valued symmetric function k(x, x) of objects a: in a set X. 
A kernel function is positive definite (pd), if for any n, any objects x±, . . . , x n £ X and 
anyvectorc £ lR n the induced kernel matrix K := (k(xi,Xj))™j =1 satisfies c T Kc > 0. 
The larger set of conditionally positive definite (cpd) kernels consists of those which 
satisfy this inequality for all c with c 1 1 = 0. These pd/cpd kernel functions got much 
attention as they have nice properties, in particular they can be interpreted/related to 
inner products in Hilbert spaces. 

In distance based learning the data samples x are not given explicitly but only by a 
distance function d(x, x') . We do not impose strict assumptions on this distance measure, 
but require it to be symmetric, have zero diagonal, i.e. d(x,x) = 0, and be nonnegative. If 
a given distance measure does not satisfy these requirements, it can easily be symmetrized 
by d{x,x') := \{d(x,x’) + d( x',x)), given zero diagonal by d(x,x') := d(x,x') — 
\ (d(x, x) + d( x', x')) or made positive by d(x, x') := \ d(x, x’)\. We call such a distance 
isometric to an L 2 -norm if the data can be embedded in a Hilbert space 'H by ( I> : X — > 'H 
such that d(x, x') = ||^(a:) — d>(x')\\. After choice of an origin O £ X every distance 
d induces a function 

(x,x')° := -^(d(x,x'f - d(x,Of - d(x',0) 2 ). (1) 

This notation reflects the idea that in case of d being the L 2 -norm in a Hilbert space X, 
(x, x')j corresponds to the inner product in this space with respect to the origin O. 

For any kernel fc ( 1 1 x — x' 1 1 ) and distance measure d we call kd{ x, x') := k(d(x, x')) 
its distance substitution kernel (DS-kernel). Similarly, for a kernel fc((x,x')) we call 
kd{x,x') := k((x,x')°) its DS-kernel. This notion is reasonable as in terms of (1) 
indeed distances are substituted. In particular for the simple linear, negative-distance, 
polynomial, and Gaussian kernels, we denote their DS-kernels by 

kd n (x, x') := (x, x')° k^ d (x, x 1 ) := -d(x, x’) 0 , £ [0, 2] (2) 

k^ ol (x,x') := ^1 + 7 (x, x ')° ^ k^(x,x') := e~ jd< - x ’ x ^ ,p £ N,y £ 1R + . 

Of course, more general distance- or dot-product based kernels exist and corresponding 
DS-kernels can be defined, e.g. sigmoid, multiquadric, /i„-spline [14], etc. 

3 Definiteness of DS-Kernels 

The most interesting question posed on new kernels is whether they are (c)pd. In fact, for 
DS-kernels given by (2) the definiteness can be summed up quite easily. The necessary 
tools and references can be found in [14]. 
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Proposition 1 (Definiteness of Simple DS-Kernels). The following statements are 
equivalent for a ( nonnegative , symmetric, zero-diagonal) distance d: 

i ) d is isometric to an L 2 -norm 

ii) is cpd for all (3 G [0, 2] in) k l ™ is pd 

iv ) is pd for all 7 G M + v ) kfj° ] is pd for all p G IN, 7 G JR + . 

Proof i ) implies ii): [14, Prop. 2.22] covers the case (3 = 2 and [14, Prop. 2.23] settles 
the statement for arbitrary f3 G [0, 2], The reverse implication ii) => i) follows by [14, 
Prop. 2.24], Equivalence of ii) and Hi) also is a consequence of [14, Prop. 2.22]. [14, 
Prop. 2.28] implies the equivalence of ii) and iv). Statement v) follows from Hi) as 
the set of pd functions is closed under products and linear combinations with positive 
coefficients. The reverse can be obtained from the pd functions ] k)f ] . With p = 1 and 

7 — > 00 these functions converge to ( x , x '}?. Hence the latter also is pd. 



Further statements for definiteness of more general dot-product or distance-based 
kernels are possible, e.g. by Taylor series argumentation. 

For some distance measures, the relation to an T 2 -norm is apparent. An example is 
the Hellinger distance H(jp,p’) between probability distributions which is defined by 
(H(p,p')) 2 := f (y/p— yjp) 1 dx. However, the class of distances which are isometric 
to L 2 -norms is much wider than the obvious forms d = \\x — x'\\. For instance, [2] 

proves very nicely that fc rb J— is pd, where 

Vx 2 



x 2 (x,y) 

i 



(Xj - Vi ) 2 
Xi + Vi 



denotes the ;y 2 -distance between histograms. Thus, according to Proposition 1, \[yf 
is isometric to an L 2 -norm. Only looking at the \ 2 -distance, the corresponding Hilbert 
space is not apparent. In summary we can conclude that not only k\" f ' (Bhattacharyya’s 

affinity) and kd h f—, but all DS-kernels given by (2) are pd/cpd when using V? or H. 

\Jx 2 

In practice however, problem specific distance measures often lead to DS-kernels 
which are not pd. A criterion for disproving pd-ness is the following corollary, which is 
a simple consequence of Proposition 1 as L 2 -norms are in particular metrics. It allows to 
conclude missing pd-ness of DS-kernels that arise from distances which are non-metric, 
e.g. violate the triangle inequality. It can immediately be applied to kernels based on 
tangent-distance [ 8 ], dynamic-time-warping (DTW) distance [1] or Kullback-Feibler 
(KF) divergence [11]. 

Corollary 1 (Non-Metricity Prevents Definiteness). Ifd is not metric then the resulting 
DS-kernel k% d is not cpd and k^ , k^° l are not pd. 

Note, that for certain values f3 , 7 ,p, the resulting DS-kernels are possibly (c)pd. Remind 
also that the reverse of the corollary is not true. In particular, the /.^-metrics for p f 2 
can be shown to produce non-pd DS-kernels. 
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4 SVMs with Indefinite Kernels 

In the following we apply DS-kernels on learning problems. For this we focus on the 
very successful SVM for classification. This method can traditionally be applied if the 
kernel functions are pd or cpd. If a given distance produces DS-kernels which are pd, 
these can be applied in SVMs as usual. But also in the non-pd case they can be useful, 
as non-cpd kernels have shown convincing results in SVMs [1,4,8,11]. This empirical 
success is additionally supported by several theoretical statements: 

1 . Feature space: Indefinite kernels can be interpreted as inner products in indefinite 
vector spaces, enabling geometric argumentation [ 10]. 

2. Optimal hyperplane classifier: SVMs with indefinite kernels can be interpreted as 
optimal hyperplane classifiers in these indefinite spaces [7], 

3. Numerics: Convergence of SVM implementations to a (possibly local) stationary 
point can be guaranteed [9] . 

4. Uniqueness: Even with extreme non-cpd kernel matrices unique solutions are pos- 
sible [7], 

5 Experiments 

We performed experiments using various distance measures. Most of them were used in 
literature before. We do not explicitly state the definitions but refer to the corresponding 
publications. For each distance measure we used several labeled datasets or several 
labelings of a single dataset. 

The dataset kimia (2 sets, each 72 samples, 6 classes) is based on binary images of 
shapes. The dissimilarity is measured by the modified Hausdorff distance. Details and 
results from other classification methods can be found in [12], We applied a multiclass- 
SVM. The dataset proteins (226 samples) consists of evolutionary distances between 
amino acid sequences of proteins [6]. We used 4 different binary labelings corresponding 
to one-versus-rest problems. The dataset cat-cortex (65 samples) is based on a matrix of 
connectivity strengths between cortical areas of a cat. Other experiments with this data 
have been presented in [5,6]. Here we symmetrized the similarity matrix and produced 
a zero diagonal distance matrix. Again we used 4 binary labelings corresponding to 
one-versus-rest classification problems. The datasets music-EMD and music-PTD are 
based on sets of 50 and 57 music pieces represented as weighted point sets. The earth- 
mover’s distance (EMD) and the proportional transportation distance (PTD) were chosen 
as distance measures, see [16]. As class labels we used the corresponding composers 
resulting in 2 binary classification problems per distance measure. The dataset USPS- 
TD (4 sets, 250 samples per set, 2 classes) uses a fraction of the well known USPS 
handwritten digits data. As distance measure we use the two-sided tangent distance 
[15], which incorporates certain problem specific transformation knowledge. The set 
UNIPEN-DTW (2 sets, 250 samples per set, 5 classes) is based on a fraction of the the 
huge UNIPEN online handwriting sequence dataset. Dissimilarities were defined by the 
DTW-distance [1], again we applied a multiclass-SVM. 

These different datasets represent a wide spectrum from easily to difficultly separable 
data. None of these distances are isometric to an L 2 -norm. The restricted number of 
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samples is consequence of the size of the small original datasets or due to the fact, 
that regularization experiments presented in Section 5.2 are only feasible for reasonably 
sized datasets. 



5.1 Pure Distance Substitution 

In this section we present results with pure distance substitution and compare them 
with the 1 -nearest-neighbour and best k-nearest-neighbour classifier. These are natural 
classifiers when dealing with distance data. 

We computed the leave-one-out (LOO) error of an SVM while logarithmically vary- 
ing the parameter C along a line, respectively C, 7 in a suitable grid. For the kernel 
a fixed polynomial degree p among { 2 , 4, 6 , 8 } was chosen after simple initial experi- 
ments. The origin O was chosen to be the point with minimum squared distance sum 
to the other training objects. As \k^ A with 0=2 and k l ^ n are equivalent in SVMs 
(which follows by plugging (1) in the SVM optimization problem and making use of the 
equality constraint), we confine ourselves to using the former. We report the best LOO- 
error for all datasets in Table 1. Note that these errors might be biased compared to the 
true generalization error, as we did not use training/validation partitions for parameter 
optimization. 



Table 1. Base LOO-errors [%] of classification experiments 



dataset 


h nd 

K d 


JL po1 

K d 


k rM 

K d 


1-nn 


k-nn 


dataset 


h nd 

K d 


k PGl 

K d 


k rhf 

K d 


1-nn 


k-nn 


kimia- 1 


15.28 


11.11 


4.17 


5.56 


5.56 


music-EMD-1 


40.00 


22.00 


20.00 


42.00 


42.00 


kimia-2 


12.50 


9.72 


9.72 


12.50 


12.50 


music-EMD-2 


42.11 


43.86 


10.53 


21.05 


21.05 


proteins-H-a 


0.89 


0.89 


0.89 


1.33 


1.33 


music-PTD- 1 


34.00 


30.00 


32.00 


46.00 


34.00 


proteins-H-/3 


3.54 


2.21 


2.65 


3.54 


3.54 


music-PTD-2 


31.58 


33.33 


28.07 


38.60 


38.60 


proteins-M 


0.00 


0.00 


0.00 


0.00 


0.00 


USPS-TD- 1 


10.40 


5.20 


3.20 


3.60 


3.60 


proteins-GH 


0.00 


0.44 


0.00 


1.77 


1.77 


USPS-TD-2 


14.40 


7.60 


2.40 


3.20 


3.20 


cat-cortex-V 


3.08 


1.54 


0.00 


3.08 


3.08 


USPS-TD-3 


12.80 


6.80 


4.00 


5.20 


5.20 


cat-cortex-A 


6.15 


3.08 


4.62 


6.15 


6.15 


USPS-TD-4 


10.80 


6.40 


3.20 


4.40 


4.00 


cat-cortex-S 


6.15 


3.08 


3.08 


6.15 


3.08 


UNIPEN-DTW- 1 


14.40 


6.00 


5.20 


5.60 


5.60 


cat-cortex-F 


7.69 


6.15 


4.62 


4.62 


3.08 


UNIPEN-DTW-2 


10.80 


7.60 


6.00 


7.20 


6.40 



The identical low errors of the 1-nn and k-nn in the datasets kimia, proteins, cat- 
cortex, USPS-TD, and UNIPEN-DTW demonstrate that the data clusters well with the 
given labeling. For the music data sets the labels obviously not define proper clusters. 

As SVMs with kernels k^ d , 0=2 can be interpreted as linear classifiers [7], the good 
performance of these on proteins and cat-cortex data is a hint on their linear separability. 
Simultaneously, the sets with higher error indicate that a nonlinear classifier in the 
dissimilarity space has to be applied. Indeed, the polynomial and Gaussian DS-kernel 
improve the results of the linear kernel for most datasets. The Gaussian DS-kernel even 
slightly outperforms the polynomial in most cases. 

Compared to the nearest neighbour results, the nonlinear distance substitutions com- 
pare very favorable. The polynomial kernel can compete with or outperform the best 
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k-nn for the majority of datasets, The Gaussian DS-kernel competes with or outperforms 
the best k-nn for all but one dataset. 

For the last two distance measures, large scale experiments with certain distance 
substitution kernels have already been successfully presented in [1,8]. In this respect, 
scalability of the results to large datasets is expected. To summarize, the experiments 
demonstrate the effectiveness of distance substitution kernels despite producing indef- 
inite kernel matrices. The result is a sparse representation of the solution by training 
examples, that is, only a small subset of training objects has to be retained. Thus, it is 
particularly suited for large training sets. 

5.2 Regularization of Kernel Matrices 

In this section we investigate different regularization methods to eliminate the negative 
eigenvalues of the kernel matrices. Similar regularizations have been performed in liter- 
ature, e.g. regularizing linear SVMs [5,12] or embedding of non-metric data [13]. The 
method denoted off-diagonal addition (ODA) simply adds a suitable constant on the off- 
diagonal elements of the squared distance matrix, which results in a Euclidean distance 
matrix and therefore can be used for distance substitution resulting in pd kernels. Two 
other methods center the kernel matrix [12] and perform an eigenvalue decomposition. 
The approach (CNE) cuts off contributions corresponding to negative eigenvalues and 
(RNE) reflects the negative eigenvalues by taking their absolute values. 

These operations particularly imply that the same operations have to be performed 
for the testing data. If the testing data is known beforehand, this can be used during 
training for computing and regularizing the kernel matrix. Note that this is not training 
on the testing data, as only the data points but not the labels are used for the kernel 
computations. Such training is commonly called transductive learning. If a test sample 
is not known at the training stage, the vector of kernel evaluations has to undergo the 
same regularization transformation as the kernel matrix before. Hence the diagonalizing 
vectors and eigenvalues have to be maintained and involved in this remapping of each 
testing vector. Both methods have the consequence that the computational complexity is 
increased during training and testing and the sparsity is lost, i.e. the solution depends on 
all training instances. So these regularization methods only apply, where computational 
demands are not so strict and sparsity is not necessary. For the experiments we used the 
transductive approach for determining the LOO errors. 

If one can do without sparsity, another simple method is used for comparisons: 
Representing each training instance by a vector of squared distances to all training points 
makes a simple linear or Gaussian SVM applicable. We denoted these approaches as 
lin-SVM resp. rbf-SVM in Table 2, which lists the classification results. 

The experiments demonstrate that regularization of kernel matrices can remarkably 
improve recognition accuracies and compete with or outperform SVMs on distance- 
vectors. The ODA regularization can increase accuracies, but it is clearly outperformed 
by the CNE and RNE methods which maintain or increase accuracy in 52 resp. 50 of 
the 60 experiments. Regularization seems to be advantageous for linear and polynomial 
kernels. For the Gaussian DS-kernels only few improvements can be observed. A com- 
parison to the last columns indicates that the (non-ODA) regularized classifiers can 
compete with the linear SVM. The latter however is clearly inferior to the regularized 
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Table 2. LOO-errors [%] of classification experiments with regularized kernel matrices 







kf 




L.POI 

K d 


K 


bf 

\ 






dataset 


ODA 


CNE 


RNE 


CNE 


RNE 


CNE 


RNE 


lin-SVM 


rbf-SVM 


kimia-1 


13.89 


8.33 


4.17 


8.33 


4.17 


4.17 


4.17 


8.33 


6.94 


kimia-2 


16.67 


9.72 


8.33 


9.72 


8.33 


9.72 


8.33 


8.33 


8.33 


proteins-H-a 


0.44 


0.89 


0.89 


0.89 


0.89 


0.89 


0.89 


1.33 


0.44 


proteins-H-/3 


3.10 


3.54 


3.98 


2.21 


2.21 


2.65 


2.65 


5.75 


2.65 


proteins-M 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


proteins-GH 


0.00 


0.00 


0.00 


0.44 


0.44 


0.00 


0.00 


0.00 


0.00 


cat-cortex-V 


6.15 


3.08 


3.08 


3.08 


3.08 


1.54 


3.08 


4.62 


3.08 


cat-cortex-A 


6.15 


4.62 


6.15 


4.62 


6.15 


4.62 


6.15 


1.54 


1.54 


cat-cortex-S 


6.15 


3.08 


4.62 


3.08 


3.08 


3.08 


3.08 


3.08 


3.08 


cat-cortex-F 


6.15 


4.62 


4.62 


4.62 


4.62 


4.62 


4.62 


1.54 


1.54 


music-EMD- 1 


44.00 


38.00 


40.00 


30.00 


40.00 


30.00 


30.00 


44.00 


20.00 


music-EMD-2 


42.11 


15.79 


21.05 


12.28 


12.28 


14.04 


10.53 


21.05 


15.79 


music-PTD-1 


38.00 


44.00 


40.00 


40.00 


38.00 


32.00 


32.00 


40.00 


28.00 


music-PTD-2 


47.37 


29.82 


38.60 


26.32 


22.81 


28.07 


17.54 


29.82 


21.05 


USPS-TD-1 


9.60 


4.00 


6.00 


4.80 


4.00 


3.20 


3.20 


6.80 


4.80 


USPS-TD-2 


14.40 


9.60 


7.20 


5.60 


4.00 


2.40 


2.40 


6.00 


4.40 


USPS-TD-3 


12.00 


6.80 


8.00 


4.00 


4.40 


4.00 


4.40 


6.80 


4.40 


USPS-TD-4 


11.20 


7.60 


6.40 


5.20 


5.60 


3.20 


3.20 


7.20 


1.60 


UNIPEN -DTW- 1 


13.20 


8.40 


8.40 


5.20 


5.60 


4.40 


4.80 


8.00 


6.80 


UNIPEN -DTW-2 


11.20 


7.60 


9.60 


6.80 


6.40 


6.00 


5.60 


9.60 


8.40 



nonlinear DS-kernels A:',’ 01 and fcjj bf . In comparison to the rbf-SVM the fcjj' 1 experiments 
can not compete. The fc^ ol -CNE experiments also perform worse than the rbf-SVM in 
12 cases. But the fc^ ol -RNE, fcjj bf -CNE resp. fcjj d -RNE settings obtain identical or better 
results than the rbf-SVM in the majority of classification problems. 

6 Conclusion and Perspectives 

We have characterized a class of kernels by formalizing distance substitution. This has 
so far been performed for the Gaussian kernel. By the equivalence of inner product 
and /."-norm after fixing an origin, distances can also be used in inner-product kernels 
like the linear or polynomial kernel. We have given conditions for proving/disproving 
(c)pd-ness of the resulting kernels. We have concluded that DS-kernels involving e.g. 
the x 2 -distance are (c)pd, and others, e.g. resulting from KL-divergence, are not. 

We have investigated the applicability of the DS-kernels by solving various SVM- 
classification problems with different data sets and different distance measures, which 
are not isometric to /. 2 -norms. The conclusion of the experiments was, that good classifi- 
cation is possible despite indefinite kernel matrices. Disadvantages of other methods are 
circumvented, e.g. test-data involved in training, approximate embeddings, non-sparse 
solutions or explicit working in feature space. This indicates that distance substitution 
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kernels in particular are promising for large datasets. In particular the Gaussian and 
polynomial DS-kernels are good choices for general datasets due to their nonlinearity. 
If sparsity of the solution is not necessary and computational demands during classifi- 
cation are not so strict, then regularizations of the kernel matrices and the test-kernel 
evaluations can be recommended. It has been shown that this procedure can substantially 
improve recognition accuracy for e.g. the linear and polynomial DS-kernels. 

Perspectives are to apply distance substitution on further types of kernels, further 
distance measures and in other kernel methods. This would in particular support recent 
promising efforts to establish non-cpd kernels for machine learning [3]. 
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Abstract. In this paper, different well-known features for image re- 
trieval are quantitatively compared and their correlation is analyzed. We 
compare the features for two different image retrieval tasks (color pho- 
tographs and medical radiographs) and a clear difference in performance 
is observed, which can be used as a basis for an appropriate choice of 
features. In the past a systematic analysis of image retrieval systems or 
features was often difficult because different studies usually used different 
data sets and no common performance measures were established. 



1 Introduction 



For content-based image retrieval (CBIR), i.e. searching in image databases 
based on image content, several image retrieval systems have been developed. 
One of the first systems was the QBIC system [4]. Other popular research sys- 
tems are Blob World [1], VIPER/GIFT [16], SIMBA [15], and SIMPLIcity [18]. 

All these systems compare images based on specific features in one way or 
another and therefore a large variety of features for image retrieval exists. Usu- 
ally, CBIR systems do not use all known features as this would involve large 
amounts of data and increase the necessary computing time. Instead, a set of 
features appropriate to the given task is ususally selected, but it is difficult to 
judge beforehand which features are appropriate for which tasks. The difficulty 
to assess the performance of a feature described in a publication is increased 
further by the fact that often the systems are evaluated on different datasets 
and few if any quantitative results are reported. 

In this work, a short overview of common features used for image retrieval 
is given and the correlation of different features for different tasks is analyzed. 
Furthermore, quantitative results for two databases representing different image 
retrieval tasks are given to compare the performance of the features. To our 
knowledge no such comparison exists yet, whereas [13] presents a quantitative 
comparison of different dissimilarity measures. 
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In the system 1 used, images are represented by features and compared using 
specific distance measures. These distances are combined in a weighted sum 

M 

D(Q, X) := E Wm * dm 0 Q mi x m ) 

m= 1 

where Q is the query image, X £ B is an image from the database B , Q m and X m 
are the mtlr features of the images, respectively, d m is the corresponding distance 
measure, and w m is a weighting coefficient. For each d m , 'Y xeB dm(Qmi X m ) = 1 
is enforced by normalization. A set 7 Z(Q) of K database images is returned with 

n(Q) = {X G B : D(Q, X) < D{Q,X') MX' e B\R.{Q)} with \R{Q)\ = K 

Using only one feature at a time, this architecture allows us to compare the 
impact of different features on the retrieval results directly. The issue of choosing 
appropriate weightings of features is addressed in [2,3]. For the quantification of 
retrieval results two problems arise: 

1. Only very few datasets with hand-labelled relevances are available to com- 
pare different retrieval systems and these datasets are not commonly used. 
A set of 15 queries with manually determined relevant results is presented 
in [15], and experiments on these data can be used for a first comparison [2]. 
Nevertheless, due to the small number of images it is difficult to use these 
data for a thorough analysis. Therefore we use databases containing general 
images which are partitioned into separate classes of images. 

2. No standard performance measure is established in image retrieval. It has 
been proposed to adopt some of the performance measures used in textual 
information retrieval for image retrieval [8]. The precision-recall-graph is a 
common performance measure which can be summarized in one number by 
the area under the graph. In previous experiments it was observed that the 
error rate (ER) of the best match is strongly correlated (with a correlation 
coefficient of -0.93) to this area [3] and therefore we use the ER as retrieval 
performance measure in the following. This allows us to compare the results 
to published results of classification experiments on the same data. 

2 Features for Image Retrieval 

In this section we present different types of features for image retrieval and the 
method of multidimensional scaling to visualize similarities between different 
features. We restrict the presentation to a brief overview of each feature and 
refer to references for further details. In this work, the goal is not to introduce 
new features but to give quantitative results for a comparison of existing features 
for image retrieval tasks. 

Table 1 gives an overview of the features and comparison measures used. 

1 http: / /www-i6. informatik.rwth-aachen.de/~deselaers/fire.html 
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Image Features. The most straight forward approach is to directly use the 
pixel values as features. For example, the images might be scaled to a common 
size and compared using the Euclidean distance. In optical character recognition 
and for medical data improved methods based on image features usually obtain 
excellent results. In this work we use the Euclidean distance and the image 
distortion model (IDM) [6] to directly compare images. 

Color Histograms. Color histograms are widely used in image retrieval, e.g. [4]. 
It is one of the most basic approaches and to show performance improvement 
image retrieval systems are often compared to a system using only color his- 
tograms. Color histograms give an estimation of the distribution of the colors in 
the image. The color space is partitioned and for each partition the pixels within 
its range are counted, resulting in a representation of the relative frequencies of 
the occurring colors. In accordance with [13], we use the Jeffrey divergence to 
compare histograms. 

Invariant Features. A feature is called invariant with respect to certain trans- 
formations, if it does not change when these transformations are applied to the 
image. The transformations considered here are mainly translation, rotation, 
and scaling. In this work, invariant feature histograms as presented in [15] are 
used. These features are based on the idea of constructing features invariant with 
respect to certain transformations by integration over all considered transforma- 
tions. The resulting histograms are compared using the Jeffrey divergence [13]. 

Invariant Fourier Mellin Features. It is well known that the amplitude spec- 
trum of the Fourier transformation is invariant against translation. Using this 
knowledge and log-polar coordinates it is possible to create a feature invariant 
with respect to rotation, scaling, and translation [14]. These features are com- 
pared using the Euclidean distance. 

Gabor Features. In texture analysis Gabor filters are frequently used [5]. In this 
work we apply the method presented in [10] where the HSV color space is used 
and hue and saturation are represented as one complex value. From these features 
we create histograms which are compared using the Jeffrey divergence [13]. 

Tamura Texture Features. In [17] the authors propose six texture features 
corresponding to human visual perception: coarseness , contrast, directionality , 
line-likeness, regularity, and roughness. From experiments testing the significance 
of these features with respect to human perception, it was concluded that the 



Table 1. Used features along with associated distance measures. 



Feature X m 


distance measure d m 


image features 

color histograms 

invariant feature histograms 

Gabor feature histograms 

Tamura texture feature histograms 

local features 

region based features 


Euclidean distance, Image Distortion Model [6] 
Jeffrey divergence [13] 

Jeffrey divergence [13] 

Jeffrey divergence [13] 

Jeffrey divergence [13] 

direct transfer, LFIDM, Jeffrey divergence [2] 
integrated region matching [18] 
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first three features are very important. Thus in our experiments we use coarse- 
ness, contrast, and directionality to create a histogram describing the texture [2] 
and compare these histograms using the Jeffrey divergence [13]. In the QBIC 
system [4] histograms of these features are used as well. 

Local Features. Local features are small (square) sub-images extracted from 
the original image. It is known that local features can yield good results in 
various classification tasks [11]. Local features have some interesting properties 
for image recognition, e.g. they are inherently robust against translation. These 
properties are also interesting for image retrieval. To use local features for image 
retrieval, three different methods are available [2]: 

1. direct transfer : The local features are extracted from each database image 
and from the query image. Then, the nearest neighbors for each of the local 
features of the query are searched and the database images containing most of 
these neighbors are returned. 2. local feature image distortion model (LFIDM): 
The local features from the query image are compared to the local features 
of each image of the database and the distances between them are summed 
up. The images with the lowest total distances are returned. 3. histograms of 
local features: A reasonably large amount of local features from the database 
is clustered and then each database image is represented by a histogram of 
indices of these clusters. These histograms are then compared using the Jeffrey 
divergence. 

Region-based Features. Another approach to representing images is based 
on the idea to find image regions which roughly correspond to objects in the 
images. To achieve this objective the image is segmented into regions. The task 
of segmentation has been thoroughly studied [9] but most of the algorithms 
are limited to special tasks because image segmentation is closely connected 
to understanding arbitrary images, a yet unsolved problem. Nevertheless, some 
image retrieval systems successfully use image segmentation techniques [1,18]. 
We use the approach presented in [18] to compare region descriptions of images. 



2.1 Correlation of Features 



Since we have a large variety of features at our disposal, we may want to select 
an appropriate set of features for a given image retrieval task. Obviously, there 
are some correlations between different features. To detect these correlations, we 
propose to create a distance matrix for a database using all available features. 
Using a leaving-one-out approach, the distances between all pairs of images from 
a database are determined for each available feature. For a database of N images 
with M features this results in an N' x M distance matrix D obtained from 
N' = N ■ (N — 1) /2 image pairs. From this matrix, the covariances S m m' and 
correlations R m m’ are determined as 



Lb 



1 N' i JV' i N' 

' = Jfi DnmDnm' ~ , Rmm’ = 



^ mm ' 




232 



T. Deselaers, D. Keysers, and H. Ney 




Fig. 1 . Example images from the WANG database. 



where D nm and D nm i denote the distances of the nth image comparison using 
the m-th and m , -th feature, respectively. The entries of the correlation matrix 
R are interpreted as similarities of different features. A high value R m m' denotes 
a high similarity in the distances calculated based on the features m and m ! , 
respectively. This similarity matrix R is easily converted into a dissimilarity 
matrix W by setting W mrn i := 1 — \Rmm'\- This dissimilarity matrix W is then 
visualized using multi-dimensional scaling. 

Multi-dimensional scaling seeks a representation of data points in a low di- 
mensional space while preserving the distances between data points as much as 
possible. Here, the data is presented in a two-dimensional space for visualization. 
A freely available MatLab library 2 was used for multi-dimensional scaling. 



3 Databases 

Due to the lack of a common database for evaluation in CBIR with known 
relevances we use two databases where relevances are implicitly given by classi- 
fications. These databases are chosen as representatives for two different types 
of CBIR tasks: The WANG database represents an CBIR task with arbitrary 
photographs. In contrast, the IRMA database represents a CBIR task in which 
the images involved depict more clearly defined objects, i.e. the domain is con- 
siderably narrower. 

WANG. The WANG database is a subset of 1000 images of the Corel database 
which were selected manually to form 10 classes of 100 images each. The images 
are subdivided into 10 sufficiently distinct classes (e.g. ‘Africa’, ‘beach’, ‘monu- 
ments’, ‘food’) such that it can be assumed that a user wants to find the other 
images from the class if the query is from one of these ten classes. This database 
was created at the Pennsylvania State University and is publicly available 3 . The 
images are of size 384 x 256 and some examples are depicted in Figure 1. 

IRMA. The IRMA database is a database of 1617 medical radiographs collected 
in a collaboration project of the RWTH Aachen University [7]. The complete data 
are labelled using a multi-axial code describing several properties of the images. 
For the experiments presented here, the data were divided into the six classes 
‘abdomen’, ‘skull’, ‘limbs’, ‘chest’, ‘breast’ and ‘spine’, describing different body 
regions. The images are of varying sizes. Some examples are depicted in Figure 2. 

2 http:/ /www. biol.ttu.edu/Strauss/Matlab/matlab.htm 

3 http://wang.ist.psu.edu/ 
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Fig. 2. Examples images from the IRMA database. 



4 Results 

To obtain systematic results for different features used in CBIR, we first analyze 
the characteristics of the features using their correlation. Then the performance 
of the retrieval system is determined for the IRMA and the WANG task. That 
is, we give leaving-one-out error rates for these two databases. Obviously one can 
expect to obtain better results using a combination of more than one feature but 
here we limit the investigation to the impact of single features on the retrieval 
result. Details about combinations of features are presented in [2,3]. 

4.1 Correlation of Features 

For improved performance, the advantages of different features can be combined. 
However, it is not clear how to choose the appropriate combination. To analyze 
which features have similar properties, we perform a correlation analysis as de- 
scribed in Section 2.1 for the WANG and IRMA database in a leaving-one-out 
manner. The results from multi-dimensional scaling are shown in Figure 3. The 
points in these graphs denote the different features. Several points of the same 
type represent different settings for the feature. The distances between the points 
indicate the correlations of the features. That is, points that are close together 
stand for features that are highly correlated and points farther away denote 
features with different characteristics. 

The graphs show that there are clear clusters of features. Both graphs have 
a large cluster of invariant feature histograms with monomial kernel functions. 
Also, the graphs show clusters of local features, local feature histograms, and 
Gabor feature histograms. The texture features do not form a cluster. This 
suggests that they describe different textural properties of the images and that 
it may be useful to combine them. In contrast, the cluster of invariant features 
shows that it is not suitable to use different invariant features at the same time. 
From Figure 3 it can be observed that region features, image features, invariant 
feature histograms, and Gabor histograms appear to have low correlation for 
the WANG data and therefore a combination of these features may be useful for 
photograph-like images. For the radiograph data the interpretation of Figure 3 
suggests to use texture features, image features, invariant feature histograms, 
and Gabor histograms. The combination of these features is addressed in [2,3]. 

4.2 Different Features for Different Tasks 

As motivated above we use the error rate (ER) to compare the performance of 
different features. In [3] it has been observed that the commonly used measures 
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+ Image features 
x inv. feat, histogram 
* Texture features 



□ Gabor histogram 
■ LF histogram 
o Color histogram 



• Region features 

* Local features 

A Fourier Mellin feature 



Fig. 3. Two-dimensional representation from multi-dimensional scaling for features on 
the WANG and IRMA database. 



precision and recall are strongly correlated to the error rate. Furthermore, the 
error rate is one number that is easy to interpret and widely used in the context 
of image classification. 

Table 2 shows error rates for different features for the WANG and IRMA 
databases. From this table it can be observed that these different tasks re- 
quire different features. For the WANG database, consisting of very general 
photographs, invariant feature histograms and color histograms perform very 
well but for the IRMA database, consisting of images with mainly one clearly 
defined object per image, these features perform badly. In contrast to this, the 
pixel values as features perform very well for the IRMA task and badly for the 
WANG task. Also, the strong correlation of invariant feature histograms with 



Table 2. Error rates [%] different features for the WANG and IRMA databases. 



WANG IRMA 



Feature 


ER [%} 




Feature 


ER, [%] 


inv. feat, histogram 


15.9 




pixel values (IDM) 


6.7 


color histogram 


17.9 




local feature histogram 


9.3 


pixel values (IDM) 


22.3 




local features 


13.0 


Tamura histogram 


31.0 




pixel values (Euclidean) 


17.7 


local feature histogram 


32.5 




Tamura histogram 


19.3 


Gabor histogram 


48.2 




Gabor histogram 


24.4 


regions 


54.3 




inv. feat, histogram 


29.2 


pixel values (Euclidean) 


55.1 




Fourier Mellin feature 


53.1 


local features 


62.5 




extended tangent distance [6] 


8.0 






pseudo 2D HMM [6] 


5.3 
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color histograms is visible: for the WANG database the invariant feature his- 
tograms yield only small improvement. In both cases, the Tamura histograms 
obtain good results taken into account that they represent the textures of the 
images only. It is interesting to observe that for the IRMA database the top 
four methods are different representations and comparison methods based on 
the pixel values, i.e. appearance-based representations. 

5 Conclusion 

In this work, we quantitatively compared different features for CBIR tasks. The 
results show clearly that the performance of features is task dependent. For 
databases of arbitrary color photographs features like color histograms and in- 
variant feature histograms are essential to obtain good results. For databases 
from a narrower domain, i.e. with clearly defined objects as content, the pixel 
values of the images in combination with a suitable distance measure are most 
important for good retrieval performance. Furthermore, a method to visual- 
ize the correlation between features was introduced, which allows us to choose 
features of different characteristics for feature combination. In the future, the 
observations regarding the suitability of features for different tasks have to be 
experimentally validated on further databases. 
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Abstract. We consider the general problem of learning from labeled 
and unlabeled data. Given a set of points, some of them are labeled, and 
the remaining points are unlabeled. The goal is to predict the labels of 
the unlabeled points. Any supervised learning algorithm can be applied 
to this problem, for instance, Support Vector Machines (SVMs). The 
problem of our interest is if we can implement a classifier which uses the 
unlabeled data information in some way and has higher accuracy than 
the classifiers which use the labeled data only. Recently we proposed a 
simple algorithm, which can substantially benefit from large amounts of 
unlabeled data and demonstrates clear superiority to supervised learning 
methods. Here we further investigate the algorithm using random walks 
and spectral graph theory, which shed light on the key steps in this 
algorithm. 



1 Introduction 

We consider the general problem of learning from labeled and unlabeled data. 
Given a set of points, some of them are labeled, and the remaining points are 
unlabeled. The task is to predict the labels of the unlabeled points. This is a 
setting which is applicable to many real-world problems. We generally need to 
also predict the labels of the testing points which are unseen before. However, in 
practice, we almost always can add the new points into the set of the unlabeled 
data. 

Any learning algorithm can be applied to this problem, especially supervised 
learning methods, which train the classifiers with the labeled data and then use 
the trained classifiers to predict the labels of the unlabeled data. At present, one 
of the most popular supervised learning methods is the Support Vector Machine 
(SVM) [9]. The problem of interest here is if we can implement a classifier which 
uses the unlabeled data in some way and has higher accuracy than the classifiers 
which use the labeled data only [10]. 

Such a learning problem is often called semi-supervised. Since labeling often 
requires expensive human labor, whereas unlabeled data is far easier to obtain, 
semi-supervised learning is very useful in many real-world problems and has 
recently attracted a considerable amount of research. A typical application is 
web categorization, in which manually classified web pages are always a very 
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small part of the entire web, but the number of unlabeled examples can be 
almost as large as you want. 

Recently we proposed a simple algorithm, which can substantially benefit 
from large amounts of unlabeled data and works much better than the supervised 
learning methods [11]. Here we further investigate the algorithm using random 
walks and spectral graph theory, which shed light on some key steps in this 
algorithm, especially normalization. 

The paper is organized as follows. In Section 2 we describe our semi- 
supervised learning algorithm in details. In Section 3 the method is interpreted 
in the framework of lazy random walks. In Section 4 we define calculus on dis- 
crete objects and then build the regularization framework of the method upon 
the discrete calculus. In Section 5 we use a toy problem to highlight the key 
steps in the method, and also validate the method on a large-scale real-world 
dataset. 

2 Algorithm 

Given a point set X = {a;i,... ,xi,xi + 1,... , x n } C and a label set C = 
{—1,1}, the first l points Xi(i < l) are labeled as yi € C and the remaining 
points x u (l + 1 < u < n) are unlabeled. Define a n x 1 vector y with yt = 1 or 
— 1 if Xi is labeled as positive or negative, and 0 if Xi is unlabeled. We can view 
■y as a real- valued function defined on A, which assigns a value yi to point Xj,. 
The data is classified as follows: 

1. Define a n x n affinity matrix W in which the elements are nonnegative, 
symmetric, and furthermore the diagonal elements are zeros. 

2. Construct the matrix S = D~ 1 ^ 2 WD~ 1 ^ 2 in which I? is a diagonal matrix 
with its (i, «)-element equal to the sum of the i-tli row of W. 

3. Compute / = (J — aS)~ l y , where / denotes the identity matrix and a is a 
parameter in (0, 1), and assign a label sgn(/j) to point Xu 

The affinity matrix can typically be defined by a Gaussian W t j = exp(— ||xj — 
Xj \\ 2 /2a 2 ) except that Wu = 0, where || • || represents Euclidean norm. We would 
like to emphasize the affinity matrix have not to be derived from a kernel [8]. 
For instance, construct a k - NN or e-ball graph on data, and then define W t j = 1 
if points Xi and Xj are connected by an edge, and 0 otherwise. Note that in this 
case the requirement Wu = 0 is satisfied automatically since there is no self-loop 
edge. 

3 Lazy Random Walks 

In this section we interpret the algorithm in terms of random walks inspired 
by [6]. We will see that this method simply classifies the points by comparing 
a specific distance measure between them and the labeled points of different 
classes. 

Let r = (V, E) denote a graph with a set V of n vertices indexed by number 
from 1 to n and an edge collection E. Assume the graph is undirected and 
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connected, and has no self-loops or multiple edges. A weight function w : Vx V — > 
R. associated to the graph satisfies w(i,j) = w(j,i), and w(i,j) > 0. Moreover, 
define w(i,j) = 0 if there is no edge between i and j. The degree d t of vertex i 
is defined to be 

di = '^2 w {i,j), (3.1) 

where j ~ i denotes the set of the points which are linked to point i. 

Let D denote the diagonal matrix with the (i, i)-th entry having value d, . Let 
W denote the matrix with the entries W %3 = w(i,j). A lazy random walk on the 
graph is decided by the transition probability matrix P = (1 — a)I + aD~ x W. 
Here cr is a parameter in (0, 1) as before. This means, with the probability 
a, following one link which connects the vertex of the current position and is 
chosen with the probability proportional to the weight of the link, and with the 
probability 1 — a, just staying at the current position. 

There exists a unique stationary distribution 7r = [wi , . . . , 7r n ] for the lazy 
random walk, i.e. a unique probability distribution satisfying the balance equa- 
tion 



7 r = 7 rP. (3.2) 

Let 1 denote the 1 x n vector with all entries equal to 1. Let vol P denote the 
volume of the graph, which is defined by the sum of vertex degrees. It is not 
hard to see that the stationary distribution of the random walk is 

tt=1D/vo1P. (3.3) 

Note that 7r does not depend on a. 

Let X t denote the position of the random walk at time t. Write T l3 = min{i > 
0|A t = xj,X o = Xi,Xi Xj} for the first hitting time to Xj with the initial 
position Xi, and write T n = min{i > 0|X t = Xi,X o = x{\, which is called 
the first return time to Xi [1], Let H^ denote the expected number of steps 
required for a random walk to reach Xj with an initial position Xi, i.e. is 
the expectation of Tjj. Hij is often called the hitting time. Let Cij denote the 
expected number of steps for a random walk starting at Xi to reach Xj and then 
return, i.e. C) 7 = H l3 + Hji, which is often called the commute time between x 3 
and Xj. Clearly, Cij is symmetrical, but Hij may be not. 

Let G denote the inverse of the matrix D — aW. Then the commute time 
satisfies [6]: 

oc Gu + Gjj — Gij — Gji, if Xi ^ Xj , (3-4) 

and [1] 

Cu = l/m. (3.5) 

The relation between G and C is similar to the inner product and the norm in 
Euclidean space. Let (xi, xf) denote the Euclidean inner product between x i 
and Xj . Then the Euclidean norm of the vector Xj — x 3 satisfies 

|| Xi - Xj || 2 = (Xi - Xj, Xi - Xj) = (X i: Xi) + (Xj, Xj) - (. Xi , Xj) - (Xj, Xi). 
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In other words, we can think of G as a Gram matrix which specifies a kind 
of inner product on the dataset. The commute time is the corresponding norm 
derived from this inner product. 

Note that is quite small whenever Xj is a node with a large stationary 
probability 7 tj. Thus we naturally consider to normalize by 

Hij = y/TTiTTjHij. (3.6) 

Accordingly the normalized commute time is 

Cij = Hji + Hij. (3.7) 

Let G denote the inverse of the matrix I — aS. Then the normalized commute 
time satisfies 



Cij oc Gu + Gjj — Gij — Gji . 
Noting the equality (3.5), we have 

r..= Gi * 

f7T~ri ’ 

V 



(3.8) 



(3.9) 



which is parallel to the normalized Euclidean product (xi, a;j)/||xi|| ||xj|| or cosine. 
Let 



P+{Xi) 



^ ^ Gij , and p_ (Ap) — ^ ) Gij- 

012/3=1} 



(3.10) 



Then the classification given by / = (/— aS)~ l y is simply checking which of the 
two values p+(xi) or p_(xi) is larger, which is in turn comparing the normalized 
commute times to the labeled points of different classes. 

If we just want to compare the non-normalized commute times to the differ- 
ent class labeled points, then the classification is given by / = (D — aW)~ 1 y. 
Although the normalized commute time seems to be a more reasonable choice, 
there is still lack of the statistical evidence showing the superiority of the nor- 
malized commute time to the non-normalized one. However, we can construct a 
subtle toy problem (see Section 5) to essentially expose the necessity of normal- 
ization. 



4 Regularization Framework 



In this section we define calculus on graphs inspired by spectral graph theory [4] 
and [3]. A regularization framework for classification problems on graphs then 
can be naturally built upon the discrete calculus, and the algorithm derived from 
the framework is exactly the method presented in Section 2. 

Let T denote the space of functions defined on the vertices of graph r, which 
assigns a value /, to vertex i. We can view / as a nx 1 vector. The edge derivative 
of / along the edge e(i, j) at the vertex i is defined to be 



dj_ 

de 



i 



\fw{i,j) 





(4.1) 
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Clearly 



dj_ = _df 

de . de 



(4.2) 



The definition (4.1) in fact splits the function value at each point among the 
edges incident with it before computing the local changes of the function, and 
the value assigned to each edge is proportional to its weight. This statement can 
be clearer if we rewrite (4.1) as 



dj_ 





fr 



The local variation of function / at each vertex i is then defined by: 



1 1 Vi/ 

where e b i means the set of the edges incident with vertex i. The smoothness 
of function / is then naturally measured by the sum of the local variations at 
each point: 

S(f) = \Y, llVi/f- (4-4) 

i 

The graph Laplacian is defined to be [4] 

A = D~ 1/2 {D - W)D~ 1/2 = 1- D~ 1/2 WD~ 1/2 = 7-5 , (4.5) 




where 5 is defined to be 5 = D 1 f 2 WD 1 / 2 . The Laplacian can be thought of 
as an operator defined on the function space: 



Af 





(4.6) 



The smallest eigenvalue of the Laplacian is zero because the largest eigenvalue 
of 5 is 1. Hence the Laplacian is symmetric and positive semi-definite. Let 1 
denote the constant function which assumes the value 1 on each vertex. We can 
view 1 as a column vector. Then 7? _1 / 2 1 is the eigenvector corresponding to the 
smallest eigenvalue of A. Most importantly, we have the following equality 

f T Af = S(f), (4.7) 



which exposes the essential relation between the Laplacian and the gradient. 

For the classification problem on graphs, it is natural to define the cost 
function associated to a classification function / to be 



arg min 

fer 



{$(/) + 



h 

2 



f-y 



2 



(4.8) 
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The first term in the bracket is called the smoothness term or regularize r, which 
requires the function to be as smooth as possible. The second term is called 
the fitting term, which requires the function to be as close to the initial label 
assignment as possible. The trade-off between these two competitive terms are 
captured by a positive parameter y. It is not hard to show that the solution of 
(4.8) is 

f = {l~a)(I -aS)~ 1 y, (4.9) 



where a = 1/(1 + y). Clearly, it is equivalent to / = (I — aS)~ 1 y. 

Finally, we discuss the non-normalized variant of the definition of edge deriva- 
tive: 



9J_ 

de • 



\/ w (hj)(fi - fj )■ 



If we further define the graph Laplacian to be L = D — W, then the equality 
(4.7) still holds. Substituting the local variation based on the non-normalized 
edge derivative into the optimization problem (4.8), we then can obtain a dif- 
ferent closed form solution / = y(yl + L)~ 1 y, which is quite close to the algo- 
rithm proposed by [2]. In Section 5, we will provide the experimental evidence 
to demonstrate the superiority of the algorithm based on the normalized edge 
derivative (4.1). 



5 Experiments 

5.1 Toy Problem 

Shown in Figure 1(a) is the doll toy data, in which the density of the data 
varies substantially across different clusters. A similar toy dataset was used by 
[7] for clustering problems. The affinity matrix is defined by a Gaussian. The 
result given by the algorithm of / = (D — aW)~ 1 y derived from non-normalized 
commute time is shown in Figure 1(b). The result given by the algorithm / = 
(/i/ + L)~ l y derived from non-normalized edge derivative is shown in Figure 
1(c). Obviously, both methods fail to capture the coherent clusters aggregated 
by the data. The result given by the algorithm / = (/ — aS)~ 1 y, presented in 
Section 2, which can be derived from both normalized commute time and edge 
derivative is shown in Figure 1(d). This method sensibly classifies the dataset 
according with the global data distribution. 

In addition, we use the toy problem to demonstrate the importance of zero 
diagonal in the first step of the standard algorithm. If we define the affinity 
matrix using a RBF kernel without removing the diagonal elements, the result 
is shown in Figure 1(e). The intuition behind setting the diagonal elements to 
zero is to avoid self-reinforcement. 

Finally, we investigate the fitting term of the regularization framework using 
the toy problem. Note that we assign a prior label 0 to the unlabeled points in the 
fitting term. This is different from the regularization frameworks of supervised 
learning methods, in which the fitting term is only for the labeled points. If 
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(a) Toy data (doll) 



(b) Non-normalized commute time (c) Non-normalized edge derivative 




Fig. 1. Classification on the doll toy data. Both methods in (b) and (c) without normal- 
ization fail to classify the points according with the coherent clusters. The importance 
of zero diagonal and the fitting on the unlabeled data are demonstrated respectively 
in (d) and (e). The result from the standard algorithm is shown in (f). 

we remove the fitting on the unlabeled points, the result is given in Figure 1(f). 
The intuition behind the fitting on the unlabeled points is to make the algorithm 
more stable. 



5.2 Digit Recognition 

we addressed a classification task using the USPS dataset containing 9298 hand- 
written digits. Each digit is a 16x16 image, represented as a 256 dimensional 
vector with entries in the range from -1 to 1. 

We used fc-NN [5] and one-vs-rest SVMs [8] as baselines. Since there is no 
reliable approach for model selection if only very few labeled points are available, 
we chose the respective optimal parameters of these methods. The fc in fc-NN 
was set to 1. The width of the RBF kernel for the SVM was set to 5. The affinity 
matrix used in our method was derived from a RBF kernel with its width equal 
to 1.25. In addition, the parameter a was set to 0.95. 

The test errors for different methods with the number of labeled points in- 
creasing from 10 to 100 are summarized in the left panel of Figure 2, in which 
each error point is averaged over 100 random trials, and samples are chosen so 
that they contain at least one labeled point for each class. The results shows 
clear superiority of our algorithm (marked as random walk ) over the supervised 
learning methods fc-NN and SVMs. The right panel of Figure 2 shows how the 
parameter a influences the performances of the method, in which the number of 
labeled points is fixed at 50. Obviously, this method is not sensitive to the value 
of a. 
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Fig. 2. Digit recognition with USPS handwritten 16x16 digits dataset for a total of 
9298. The left panel shows test errors for different algorithms with the number of labeled 
points increasing from 10 to 100. The right panel shows how the different choices of 
the parameter a influence the performance of our method (with 50 labeled points). 

Acknowledgments. We would like to thank Arthur Gretton for helpful dis- 
cussions on normalized commute time in random walks. 
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Abstract. We compare two approaches to the problem of estimating 
the depth of a point in space from observing its image position in two 
different cameras: 1. The classical photogrammetric approach explicitly 
models the two cameras and estimates their intrinsic and extrinsic pa- 
rameters using a tedious calibration procedure; 2. A generic machine 
learning approach where the mapping from image to spatial coordinates 
is directly approximated by a Gaussian Process regression. Our results 
show that the generic learning approach, in addition to simplifying the 
procedure of calibration, can lead to higher depth accuracies than clas- 
sical calibration although no specific domain knowledge is used. 



1 Introduction 



Inferring the three-dimensional structure of a scene from a pair of stereo images 
is one of the principal problems in computer vision. The position X = (X, Y. Z) 
of a point in space is related to its image at x = ( x , y) by the equations of 
perspective projection 



ril (X-X 0 ) + r 21 (Y-Y 0 )+r 31 (Z~Z 0 ) , „ . . 

X — X 0 S xy c ■ _ t v V \ „ IV V\ , „ I r7 rj \ ( X ) 



y = 2 /o - c • 



r 13 (X - X 0 ) + r 23 (Y - E 0 ) + r 33 {Z - Z 0 ) 
r 12 (X - X 0 ) + r 22 (Y -Y 0 ) + r 32 (Z - Z 0 ) 
ri 3 (X - Xo) + r 23 (Y - Y 0 ) + r 33 {Z - Z 0 ) + ^ 



(x) 



(1) 

(2) 



where xo = (xo,yo) denotes the image coordinates of the principal point of the 
camera, c the focal length, X 0 = (X 0 ,Y 0 , Z 0 ) the 3D-position of the camera’s 
optical center with respect to the reference frame, and r,;j the coefficients of a 
3x3 rotation matrix R describing the orientation of the camera. The factor s xy 
accounts for the difference in pixel width and height of the images, the 2-D-vector 
field S’(x) for the lens distortions. 

The classical approach to stereo vision requires a calibration procedure before 
the projection equations can be inverted to obtain spatial position, i.e., estimat- 
ing the extrinsic (Xo and R) and intrinsic (xo, c, s xy and S’) parameters of each 
camera from a set of points with known spatial position and their corresponding 
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image positions. This is normally done by repeatedly linearizing the projection 
equations and applying a standard least square estimator to obtain an itera- 
tively refined estimate of the camera parameters [1] . This approach neglects the 
nonlinear nature of the problem, which causes that its convergence critically de- 
pends on the choice of the initial values for the parameters. Moreover, the right 
choice of the initial values and the proper setup of the models can be a tedious 
procedure. 

The presence of observations and desired target values on the other hand, 
makes depth estimation suitable for the application of nonlinear supervised learn- 
ing algorithms such as Gaussian Process Regression. This algorithm does not re- 
quire any specific domain knowledge and provides a direct solution to nonlinear 
estimation problems. Here, we investigate whether such a machine learning ap- 
proach can reach a comparable performance to classical camera calibration. This 
can lead to a considerable simplification in practical depth estimation problems 
as off-the-shelf algorithms can be used without specific adaptations to the setup 
of the stereo problem at hand. 

2 Classical Camera Calibration 

As described above, the image coordinates of a point are related to the cameras 
parameters and its spatial position by a nonlinear function F (see Eqs. 1 and 2) 

x — F(x 0 , c, s X y , R, X 0 , zz , X) (3) 

The estimation of parameters is done by a procedure called bundle adjustment 
which consists of iteratively linearizing the camera model in parameter space and 
estimating an improvement for the parameter from the error on a set of m known 
pairs of image coordinates x, = ( Xi , yi) and spatial coordinates X,; = ( Xi , lj, Zi). 
These can be obtained from an object with a distinct number of points whose 
coordinates with respect to some reference frame are known with high precision 
such as, for instance, a calibration rig. 

Before this can be done, we need to choose a low-dimensional parameteri- 
zation of the lens distortion field E because otherwise the equation system 3 
for the points 1 . . . m would be underdetermined. Here, we model the x- and y- 
component of E as a weighted sum over products of one-dimensional Chebyclrev 
polynomials Tj in x and y , where i indicates the degree of the polynomial 

t t 

Sx(x) = Y aijTi{s x x)Tj(syy), E v (x) = Y t>ijTi(s x x)Tj(s y y), (4) 
ij=0 i,j = 0 

The factors s x ,s v scale the image coordinates to the Chebyclrev polynomials’ 
domain [—1,1]. In the following, we denote the vector of the complete set of 
camera parameters by 9 = (x 0 , c, s xy , R , X 0 , an, . . . , a tt , bn, ... , b tt ). 

In the iterative bundle adjustment procedure, we assume we have a parameter 
estimate 0„_i from the previous iteration. The residual 1; of point i for the 
camera model from the previous iteration is then given by 

b = Xj - F(6»„_i,X i ). 



( 5 ) 
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This equation system is linearized by computing the Jacobian J(0 n - 1 ) of F at 
0 n - 1 such that we obtain 



1 « J(6 n -i)A0 (6) 

where 1 is the concatenation of all 1, and AO is the estimation error in 0 that 

causes the residuals. Usually, one assumes a prior covariance Eu on 1 describing 

the inaccuracies in the image position measurements. AO is then obtained from 
a standard linear estimator [3] 

AO = {J T E^ J)- 1 JE^l . (7) 

Finally, the new parameter estimate 0 n for iteration n is improved according to 
0 n = 0 n - 1 + AO. Bundle adjustment needs a good initial estimate 0 o for the 
camera parameters in order to ensure that the iterations converge to the correct 
solution. There exists a great variety of procedures for obtaining initial estimates 
which have to be specifically chosen for the application (e.g. aerial or near-range 
plrotogrammetry) . 

The quality of the estimation can still be improved by modelling uncertain- 
ties in the spatial observations Xj. This can be done by including all spatial 
observations in the parameter set and updating them in the same manner which 
requires the additional choice of the covariance Exx of the measurements of 
spatial position [1]. Exx regulates the tradeoff between the trust in the accu- 
racy of the image observations on the one hand and the spatial observations on 
the other hand. For more detailed information on bundle adjustment please refer 
to [1]. 

Once the parameter sets 0 U) and 0 (2> of the two camera models are known, 
the spatial position X* of a newly observed image point (xjj in the first and xj 
in the second camera) can be estimated using the same technique. Again, F de- 
scribes the stereo camera’s mapping from spatial to image coordinates according 
to Eqns. 1 and 2 



x* =F(f?«,X*), As = 1,2 (8) 

but this time the 0 are kept fixed and the bundle adjustment is computed for 
estimates of X* [1], 

3 Gaussian Process Regression 

The machine learning algorithm used in our study assumes that the data are 
generated by a Gaussian Process (GP). Let us call /(x) the non-linear func- 
tion that maps the D-dimensional input x to a 1-dimensional output. Given 
an arbitrary set of inputs {x.j|i = 1 , . . . ,rn}, the joint prior distribution of the 
corresponding function evaluations f = [/(xi), . . . , /(x m )] T is jointly Gaussian: 



p(f|xi,... ,x m ,0) ~N(Q,K) , 



(9) 




248 



F.H. Sinz et al. 



with zero mean (a common and arbitrary choice) and covariance matrix K. 
The elements of I\ are computed from a parameterized covariance function, 
K,,j = fc(xj, Xj, 9), where 9 now represents the GP parameters. In Sect. 4 we 
present the two covariance functions we used in our experiments. 

We assume that the output observations y,; differ from the corresponding 
function evaluations /(x^) by Gaussian additive i.i.cl. noise of mean zero and 
variance a 2 . For simplicity in the notation, we absorb a 2 in the set of parameters 
9. Consider now that we have observed the targets y = [y i, . . . , y m ] associated to 
our arbitrary set of m inputs, and would like to infer the predictive distribution 
of the unknown target y* associated to a new input x*. First we write the joint 
distribution of all targets considered, easily obtained from the definition of the 
prior and of the noise model: 



Xl,... 



,0 ) ~ AT 0, 



K + a 2 X 

/c(x* 



,) + ' 



(10) 



where k» = [fc(x*, Xi), . . . , fc(> 



j)] T is the covariance between y* and y, 



and X is the identity matrix. The predictive distribution is then obtained by 
conditioning on the observed outputs y. It is Gaussian: 



P(2/*|y> x i,--- ,x m ,6>) ~.AA(m(x*),n(x*)) , (11) 

with mean and variance given respectively by: 
m(x*) = kj [K + a 2 X}-^ , 
v(x*) = a 2 + /c(x*,x») — kj [. K + cr 2 I] _ 1 k» . 



Given our assumptions about the noise, the mean of the predictive distribution 
of /(x*) is also equal to m(x*), and it is the optimal point estimate of /(x*). It 
is interesting to notice that the prediction equation given by to(x*) is identical 
to the one used in Kernel Ridge Regression (KRR) [2]. However, GPs differ from 
KRR in that they provide full predictive distributions. 

One way of learning the parameters 6 of the GP is by maximizing the evidence 
of the observed targets y (or marginal likelihood of the parameters 9). In practice, 
we equivalently minimize the negative log evidence, given by: 

-logp(y|xi,. .. ,xi,0) = ^log|A' + a 2 I\ + ^y T [A' + cr 2 I]^ 1 y . (13) 

Minimization is achieved by taking derivatives and using conjugate gradients. 
An alternative way of inferring 9 is to use a Bayesian variant of the leave-one- 
out error (GPP, Geisser’s surrogate predictive probability, [4]). In our study we 
will use both methods, choosing the most appropriate one for each of our two 
covariance functions. More details are provided in Sect. 4. 



4 Experiments 

Dataset. We used a robot manipulator holding a calibration target with a flat- 
tened LED to record the data items. The target was moved in planes of different 
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Fig. 1 . Robot arm and calibration target, which were used to record the data items. 

depths, perpendicular to the axis of the stereo setup. The spatial position of the 
LED was determined from the position encoders of the robot arm with a nomi- 
nal positioning accuracy of O.Olrmn. The center of the LED was detected using 
several image processing steps. First, a threshold operation using the upper 0.01 
percentile of the image’s gray-scale values predetected the LED. Then a two- 
dimensional spline was fitted through a window around the image of the LED 
with an approximate size of 20 px. A Sobel operator was used as edge detector on 
the spline and a Zhou operator located the LED center with high accuracy (see 
[1]). We recorded 992 pairs of spatial and image positions, 200 of which were 
randomly selected as training set. The remaining 792 were used as test set. 



Classical calibration. During bundle adjustment, several camera parameters 
were highly correlated with others. Small variations of these parameters pro- 
duced nearly the same variation of the function values of F, which lead to a lin- 
ear dependency of the columns of J and thus to a rank deficiency of J T S^J. 
Therefore, the parameters of a correlating pair could not be determined properly. 
To avoid such high correlations we excluded xo, aoo, boo, aio, &oi, « 12 , & 21 , aoi> <220 
and CI 02 from estimation and set aoi, 020 , a 02 to the values of 601 , £> 02 , &20 (see [7] 
for more detailed information on the parameterization of the camera model and 
[ 6 ] for the exact procedure in our setting). 

We used a ten-fold crossvalidation scheme to determine whether the corre- 
sponding coefficients should be included in the model or not. The error in the im- 
age coordinates was assumed to be conditionally independent with er 2 = 0.25 px, 
so the covariance matrix Su became diagonal with Su = 0.25 • X. The same 
assumption was made for Sxx, though the value of the diagonal elements was 
chosen by a ten fold cross validation. 



Gaussian Process Regression. For the machine learning approach we used both 
the inhomogeneous polynomial kernel 



k(x, x ') = cr 2 (x, x 1 + l) 9 



(14) 
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Table 1 . Test error for bundle adjustment and Gaussian Process Regression with 
various kernels, computed on a set of 792 data items. Root mean squared error of the 
spatial residua was used as error measure. 



Method 


Test Error 


Preprocessing 


Bundle adjustment 


0.38mm 


- 


Inhomogeneous polynomial ( g = 4) 


0.29mm 


scaled input 


Inhomogeneous polynomial ( g = 4) 


0.28mm 


transformed, scaled input 


Squared exponential 


0.31mm 


scaled input 


Squared exponential 


0.27mm 


transformed, scaled input 



of degree g and the squared exponential kernel 



fc(s, s') = ^ exp ^ - ^) 2 j . (15) 

with automatic relevance determination (ARD). Indeed, the lengthscales A^ can 
grow to eliminate the contribution of any irrelevant input dimension. 

The parameters a 2 , a 2 and g of the polynomial covariance function were es- 
timated by maximizing the GPP criterion [4] . The parameters a 2 , a 2 and the A d 
of the squared exponential kernel were estimated by maximizing their marginal 
log likelihood [5]. In both cases, we used the conjugate gradient algorithm as 
optimization method. 

We used two different types of preprocessing in the experiments: 1. Scaling 
each dimension of the input data to the interval [—1,1]; 2. Transforming the 
input data according to 



(xi,yi,x 2 , 2 / 2 ) (0.5(a;i - x 2 ), 0.5(a:i + x 2 ), 0.5(yi - y 2 ), 0.5(yi + y 2 )) . 

The output data was centered for training. 



5 Results 

The cross validation for the camera model yielded ux = 2 mm as best a priori 
estimation for the standard deviation of the spatial coordinates. In the same 
way, a maximal degree of t = 3 for the Clrebyclrev polynomials was found to be 
optimal for the estimation of the lens distortion. Table 1 shows the test errors 
of the different algorithms and preprocessing methods. 

All algorithms achieved error values under one millimeter. Gaussian Pro- 
cess regression with both kernels showed a superior performance to the classical 
approach. Fig. 2 shows the position error according to the test points actual 
depth and according to the image coordinates distance to the lens center, the so 
called excentricity. One can see that the depth error increases nonlinearly with 
increasing spatial distance to the camera. Calculation of errors shows that the 
depth error grows quadratically with the image position error, so this behaviour 
is expected and indicates the sanity of the learned model. Another hint that all 
of the used algorithms are able to model the lens distortions is the absence of a 
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trend in the right figure. Again, the learning algorithms do better and show a 
smaller error for almost all excentricities. 

The superiority of the squared exponential kernel to the polynomial can be 
explained by its ability to assign different length scales to different dimensions of 
the data and therefore set higher weights on more important dimensions. In our 
experiments -A- and -A were always approximately five times larger than -A- and 

A , which is consistent with the underlying physical process, where the depth of 
a point is computed by the disparity in the ^-direction of the image coordinates. 
The same phenomenon could be observed for the transformed inputs, where 
higher weights where assigned to the x\ and # 2 - 

6 Discussion 

We applied Gaussian Process Regression to the problem of estimating the spatial 
position of a point from its coordinates in two different images and compared its 
performance to the classical camera calibration. Our results show that the generic 
learning algorithms performed better although maximal physical knowledge was 
used in the explicit stereo camera modelling. 

As both approaches are able to model the depth estimation very precisely, 
the different error rates can be explained by their different abilities to account for 
lens distortions. The flexible parameterization of Gaussian Processes allows for a 
spatially detailed modeling of lens distortion. In contrast, a significant estimation 
of the higher-order Chebychev polynomials capable of modeling strongly space- 
variant distortion fields turned out to be impossible. A further reason might lie 
in the sensitivity of bundle adjustment to parameter initialization since there is 
no obvious a priori choice for the initial values of the Chebychev coefficients in 
most cases. 

An additional advantage of our approach is the mechanical and therefore 
simple way of model selection, while the correct parametrization of a camera 





Fig. 2. Position error depending on the actual depth of the test point (left figure) and 
on the distance to the lens center, the so called excentricity (right figure). 
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model and elimination of correlating terms is a painful and tedious procedure. 
Moreover the convergence of the regression process does not depend on good 
starting values like the estimation of the camera model’s parameters does. 

A disadvantage of the machine learning approach is that it does not give 
meaningful parameters such as position and orientation in space or the camera’s 
focal length. Moreover, it does not take into account situations where the exact 
spatial positions of the training examples are unknown, whereas classical camera 
calibration allows for an improvement of the spatial position in the training 
process. 

The time complexity for all algorithms is 0(m 3 ) for training and 0(n) for 
the computation of the predictions, where m denotes the number of training 
examples and n the number of test examples. In both training procedures, 
matrices with a size in the order of the number of training examples have to 
be inverted at each iteration step. So the actual time needed also depends on 
the number of iteration steps, which scale with the number of parameters and 
can be assumed constant for this application. Without improving the spatial 
coordinates, the time complexity for the training of the camera model would 
be 0(p 3 ), where p denotes the number of parameters. But since were are also 
updating the spatial observations, the number of parameters is upper bounded 
by a multiple of the number of training examples such that the matrix inversion 
in (7) is in 0(m 3 ). An additional advantage of GP is the amount of time 
actually needed for computing the predictions. Although predicting new spatial 
points is in O(n) for GP and the camera model, predictions with the camera 
model always consume more time. This is due to the improvements of the initial 
prediction with a linear estimator which again is an iterative procedure in- 
volving an inversion of a matrix of constant size at each step (cf. end of section 2) . 
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Abstract. The recent development of graph kernel functions has made 
it possible to apply well-established machine learning methods to graphs. 
However, to allow for analyses that yield a graph as a result, it is nec- 
essary to solve the so-called pre-image problem: to reconstruct a graph 
from its feature space representation induced by the kernel. Here, we 
suggest a practical solution to this problem. 



1 Introduction 

Many successful classical machine learning methods are originally linear and 
operate on vectorial data, e.g. PCA, k-means clustering and SVM. As these al- 
gorithms can be expressed in terms of dot products only, their scope is extended 
by kernelization [8]: the dot products are replaced by a kernel function, which 
implicitly maps input data into a vectorial space (the feature space) . Kerneliza- 
tion not only makes the algorithms effectively non-linear (in input space), but 
also allows them to work on structured data. For example, kernels on strings 
and graphs were proposed in [3], leading to text classification systems [5] or a 
chemical compound classification system [7], where every chemical compound is 
considered as an undirected graph. 

Once a suitable kernel function is developed, the kernel approach works fine 
for algorithms with numerical output (like regression, classification or cluster- 
ing) . If the output shall be a point in the space of the original (structured) data, 
additionally the pre-image problem has to be solved: a point in the featuer space 
must be mapped back to the input space. For example, outputs in input space 
are desired when averages of input data are computed or elements optimizing 
some criteria are sought for; other applications are described in [1]. 

The pre-image problem can be formalized as follows. We are given a positive 
definite kernel function k which computes dot products of pairs of members g, g' 
of an input space Q . The kernel induces some Reproducing Kernel Hilbert Space 
(RKHS) T, called the feature space, and a mapping </> : Q — > T, such that 
fc(g,g') = (<Xg), </>(g')) [8]. Finally, we are given the feature space representation 
ip = 4>{g*) of a desired output g*. Then the pre-image problem consists of finding 
g*. Note that this is often not possible, since T is usually a far larger space than 
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Fig. 1 . Searching for pre- 
images g* of a point ip given in 
feature space T corresponds 
to finding the nearest point 
<p( g*) on a nonlinear manifold. 



Q (see Figure 1). In these cases, the (approximate) pre-image g* is chosen such 
that the squared distance of ip and </>( g*) is minimized, 

g* = argmin||^ - </>(g)|| 2 . (1) 

g 

A general learning-based framework for finding pre-images is described in [1] . 
The pre-image problem is particularly challenging with input spaces of discrete 
structures like strings or graphs, because it is difficult to apply gradient-based 
optimization to combinatorial problems. For strings, it is often possible to take 
advantage of their linear structure, for example by dynamic programming; a 
different, but also incremental approach is presented in [1], 

In this paper we consider the pre-image problem for graphs. More precisely, 
we take the input space Q to be the set of node-labeled graphs, and assume the 
kernel k to be the graph kernel described in [7] . For this setting, we propose an 
algorithm to approximate the pre-image of a graph g* € Q of a point ip € T . 
To our knowledge, no such algorithm has been published before. In constrast to 
the computation of intermediate graphs [4], our approach also allows for extra- 
polation. Furthermore, it can be directly used with a large number of established 
machine learning algorithms. 

There are many potential applications of pre-images of graphs. One exam- 
ple would be the reconstruction of finite automata or the regular languages 
represented by them. Probably the most intriguing scenario is the synthesis of 
molecules, the 2D representations of which are essentially labeled, undirected 
graphs. Since molecules with appropriate properties serve for examples as chem- 
ical materials or as drugs, their directed design is very attractive. Correspond- 
ingly, some work on the de novo design of drugs has been done. One strategy, 
persued in [6], is to focus on linear chain molecules (like RNA and peptides). 
This makes the problem much easier, but restricts applicability. A rather gen- 
eral genetic algorithm for molecular design is presented in [9]. While it could, in 
principle, also be used to optimize our criterion (1), it does not take advantage 
of the structure of the RKHS implied by the kernel function. Thus, our approach 
is probably better suited for use in combination with kernel methods that rely 
on this kernel. 

In the following section, we briefly review the used kernel for labeled graphs, 
and its induced feature space geometry. Thereafter, we present the main contri- 
bution: a proposed method to approximate pre-images for undirected graphs. In 
the experimental section, we demonstrate its feasibility with a linear interpolata- 
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tion between two molecules represented as graphs. We conclude by discussing 
weaknesses and further directions. 



2 Comparing Graphs: The Marginalized Graph Kernel 



In the following we consider node-labeled, undirected graphs g = ( v,E ), where 
the nodes are identified with the set V = {1, . . . , |g|}, |g| is the order of the graph, 
the function v : V — > E supplies the labels of the nodes which are taken from 
some finite set E, and E £ {0, l}lsl x lsl is the adjacency matrix, i.e. e(i,j) = 1 iff 
there is an edge between nodes i and j. Q denotes the set of all possible graphs. 

We briefly review the marginalized graph kernel introduced in [7], basically 
following the notation used there while omitting edge labels. Let L?( g) denote 
the set of all possible node paths h on a graph g = (V,E), i.e. h £ V n satisfies 
i?(hj,hj + i) = 1 for every i < ri: |h| := n is the path length. We define a 
probability distribution p(h|g) over 12(g) by considering random walks with start 
probability p s (i) at node i, a constant termination probability p q (i) = A and a 
transition probability pt(i,j) which is positive only for edges, i.e. if E(i,j) = 1. 
Therefore the posterior probability for a path h is described as 

|h| 

p(h|g) = p s (hi) JJp(hi|hi_i)p e (hi). (2) 

i = 2 

Now any kernel kh on node paths induces a kernel on graphs via 

k( g,g') = Eh,h'[fc/i(h,h')] = E E fc/i(h, h , )p(h|g)p(h , |g'). (3) 

her2(g) h'efi(g') 



In the following, we will only consider the matching kernel (or (5-kernel) on the 
labels of the nodes visited by the pathes: 



^(h,h') 



1 if |h| = |h'| A A[=i v hi = v' K 

0 otherwise 



(4) 



For the efficient computation of the graph kernel, it is convenient to define the 
following matrices: 



s(hi,h' 1 ) = ^(uh^Uh'Jp^hijp^hi), 
rfhfc.hn^^h'n) = (5) 

9(hi,h'-) = p e (hi)pe(h'). 

Substituting (2), (4) and (5) into (3), one can derive a compact form of the graph 
kernel (see [7] for details): 

k(g, g') = s(g, g') T (I - T( g, g'))^ 1 q(g, g')- 



( 6 ) 
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3 Pre-images for Undirected Graphs 

In the following we discuss an approximative solution to the problem stated 
in equation (1), that is we reconstruct the graph g* given its representation 
ip in feature space T, such that </>(g*) is as close as possible to ijj. Note that 
solving problem (1) requires to determine the vertex label set E* £ g* and the 
adjacency matrix E* £ g*. Since joint optimization over the discrete set E v and 
the adjacency matrix A is not feasible, we introduce a sequential approach to 
solve (1) approximately. This is quite natural, since the structure of the adjacency 
matrix depends on the size and the labels of the label set E v . 

We assume in the following that the point i/j is expressed as a linear combi- 
nation of some given points (</>(gi), . . . , <^(gAr)} £ D N , that is 

N 

X] a ^(gi) = ( 7 ) 

i — 1 

This assumption is always satisfied in the case of kernel methods. In the fol- 
lowing we will use relation (7) of the given data -Djv to if) and properties of the 
marginalized graph kernel to infer E* and E*. 

3.1 Determining the Order of the Graph 

It may come as a surprise that the kernel values do not contain information 
about the graph sizes. To see why this is the case, consider an easy example. Fix 
an arbitrary graph g and its kernel values with other graphs. Now consider the 
graph 2g which is defined by duplicating g, i.e. it consists of two unconnected 
copies of g. Thus, the start probabilities (for each copy of each node) are divided 
by two, while the transition and termination probabilities remain unchanged. 
Thus, for each of the two copies, the feature space representation (the histogram 
of label pathes) is also divided by two. Adding the histograms of the two copies 
(which corresponds to using the combined graph) recovers the original feature 
space representation. Therefore, all kernel values have to be the same. 

Since the kernel does not help in determining the size, we have to make use 
of heuristics. Below, we consider three simple ideas to fix m* := |g*|, the size 
of g*: (i) linear combination of the input graph sizes; (ii) exhaustive search in a 
range; and (iii) learning (regression). 

An intuitive idea is to determine m* as a linear combination of the input 
graph sizes. It is natural to give each input graph the same amount (a*) of 
influence on the size of g as it has on g. Thus we have the weighted average 




where to, is the size of graph g,;. 

For exhaustive search we need to restrict ourselves to a finite range of plausi- 
ble sizes, for example between minimum and maximum size of the input graphs. 
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Then a tentative graph is reconstructed for each size in the range, and the one 
best approximating ip is chosen afterwards. 

The most sophisticated approach is to learn the graph size. Since we have 
example graphs given in D jv, we can state the order estimation as a regression 
problem: Using the kernel map, we explicitly construct features Xi, 1 < i < 
N by Xi = (k(gi, gi), . . . , k(gi, gjv)) £ Letting = |g, ( |, we can apply 
any regression method to estimate their dependence on the x. Learning the 
graph order has the advantage that it does not use any knowledge of the graph 
kernel itself, but has the disadvantage to be the computationally most expensive 
approach. Once the order |g*| is estimated, we are able to determine the vertices 
in U* eg*. 



3.2 Determining the Vertex Set 

The vertex set E* contains the labels of vertices in the graph g*. Determining 
E* requires to decide if a label v, is a member of E*, and if so, how many 
times it appears in the graph. To answer these two questions, we will use special 
properties of the marginalized graph kernel. We introduce the trivial graph, g Vi , 
which consists just of one vertex u* £ E and zero edges. A random walk on the 
trivial graph g Vi creates a single path of length 1 consisting of the single vertex 
label Vi itself. Reconsidering the terms appearing in the graph kernel, one sees 
that the only nonzero terms are 

s(hi = Vi, hi = Vi) = p s (h[ = Vi), 

(j(hi = Vi, hi = Vi) = p e ( hi = u i )p e (h , 1 = V,), 

while T £ ]Rlsl x lgl becomes a zero matrix. Assuming a constant termination 
probability p e (v k ) = A and uniform start probabilities p s (hi = v k ) = 1/m,, the 
evaluation of the graph kernel yields 

k(g Vk ,g i) = mik • Ps(hi = Vk) ■ P 2 e {v k ) = m,ik ■ \ 2 /mi, 
k{ g Vk ig*) = rn* k -p s ( h* = v k ) ■ p 2 e {v k ) = m* k ■ X 2 /m*, 

where mi k and m* k denote the numbers of occurrence of the label v k in the 
graph g, and g*, respectively. 

We are now able to find the vertex set by solving for m* k : 

N N 

,/ \ / \ 2 m * \2 m 'ik \ ^ m ik , Q \ 

m*k = k(gv k ,g*) ■ m*/A = = m* > . (8) 

' rrii ' m, 

1=1 1=1 

The last equality shows that the fractions rn tk /M. t of labels in g» are just the 
linear combinations of the fractions in the input graphs. This can be combined 
with any preset order m* of g* . 
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3.3 Estimating the Adjacency Matrix 

As a final step, we have to determine the adjacency matrix E. Since the adjacency 
matrix consists of boolean variables, and our objective function is not linear, the 
optimal solution would require to solve a non-convex non-linear combinatorial 
optimization problem. We therefore propose a stochastic search approach. 

We define the distance of a graph g to the point ip as the distance of the 
image <p(g) to ip, the square of which can be calculated by 

N N 

d 2 {g,g*) = \\<p(g) - ip\\ 2 = k(g,g) - 2^2aik{g,g t ) + Y onajk(gi, gj). 

i — 1 i,j=l 

Let Dk = {gi , . . . , gfc} C Dn be the set of k nearest neighbors to ip. Starting from 
that set, we construct a Markov chain go, gi, . . . , that is -P(gi+i|go, gi, • . . , Dk) = 
P{g i+ i\gi, D k ), with the property that 



d(gi+i,g*) < d(gi,g*) 

and gi being the actual state of the Markov chain. New proposal states are 
created by sampling from the vicinity of graphs in Dk U {gi} and are rejected if 
the distance to ip is larger than the actual state. 

Before we describe the sampling strategy in detail, we define the spread \{D S ) 
of an arbitrary set D s = {gi,...,g s } as the sample variance of the feature 
coordinates {k Ds (gi), . . . , k DN (g s )} with k DN (g) = (k{g,g 1 ),,..,k(g,g N )) r and 

gi, ■ • ■ j gN £ Dat. 

Now that we have defined spread, we can propose a sampling strategy. For 
each graph g s £ D k U {gi}, we generate a set G gs of l new graphs gs £ G gs , 
1 < t < l by randomly inserting and deleting edges to g s . Since the spread 
x(G g J corresponds to the size of the sampled area in Q, we require the distance 
d(g s ,g*) to be proportional to y(G gs ). That is, the further the graph g s is from 
g*, the more diverse will the generated set G gs be. This can be achieved by 
using a monotone function / : K. — > N, which maps the distance d(g s ,g*) to 
the number of random modifications. For example the natural logarithm with a 
proper scaling of the distance can be used: f(d( g s ,g*)) = }log(a • d(g s ,g*))] . 
We are now ready to formulate the stochastic search for the adjacency matrix 
E* of g*. 

Algorithm 1 Let go = argmin g Dn be the nearest neighbor of ip in Dn ■ 

Let D k be the set consisting of k nearest neighbors of ip in Dn- 
Let i — 

While r < rma X 

Generate l new proposal graphs {gs^}i=i =: G gs for each g 3 £ Dk U {g;} 
by adding and deleting f(d( g s ,g*)) edges. 

- g new = argmin ge(J j =i Ggs ||0(g) - ip\\ . 

~ Ifd(gnew,g*)<d{gi,g*) 

gi+1 = g new, i = i + 1, r = 0 

else 

r = r + 1 



The stochastic search is stopped if it could not find k max times a better g* . 
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4 Experiment: Graph Interpolation 

In this section, we will demonstrate the introduced technique to interpolate 
between two randomly selected graphs of the Mutag dataset [2] consisting of 230 
molecules (aromatic and heteroaromatic nitro compounds). The vertex labels 
identify the atom types and edge labels indicate the bonding type. For our 
experiment we ignore the edge label information and just set the adjacency 
entries to 1 if there exists any type of bond between two atoms (0 otherwise). 

For the graph kernel, we use the same parameter A = 0.03 as in [7]. We select 
randomly two molecules which we denote by gi, g 2 and set if) = a<j>( gi) + (1 — 
a)0(g2). We perform algorithm 1 for every a € [0, 0.1, . . . , 1], where we use k = 5 
nearest neighbors. We used r max = 10 and l = 500 in our simulation. A series of 
resulting graphs is shown in Figure 2. The quality of the computed pre-images 
is compared to the nearest neighbor approach in Figure 3. The pre-images have 
much smaller distances to if) than the nearest neighbor in the database. 




Fig. 2. Pre-images for interpolated graphs (using the marginalized graph kernel) found 
by stochastic search by Algorithm 1. 

5 Discussion and Outlook 

We presented a pre-image technique for graphs. The stochastic search approach 
is able to synthesize graph pre-images, if there are examples in the vicinity of 
the pre-image. However, where no examples are provided in the region of the 
pre-image, the quality of the found pre-image also decreases (i.e., its distance 
increases). This phenomenon is illustrated as the correlation between the two 
measures shown in Figure 3. A likely explanation is that the generation of graphs 
(the proposal states in Section 3.3) fails to sample the space in the direction 
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Fig. 3. The dis- 
tance of the recon- 
structed graph <j>( gi) 
to a<j>(gi) + (1 - 
a)4>( gi) vs. the dis- 
tance of the near- 
est neighbor in Dn- 
The maximum recon- 
struction error was 
used for normaliza- 
tion. 



towards ?/>. We plan to improve the quality of the stochastic search method by 
using a priori knowledge of the pre-image to be found, and also to explore other 
stochastic optimization techniques. 

Although chemical engineering provides a strong motivation for our work, 
we point out that this paper deals with pre-images of general graphs, and that 
almost no use is made of the special properties of molecular graphs. As an im- 
portant example, the numbers of incident bonds (edges) must agree to the atom 
types (node labels). Such properties will have to be taken into account for seri- 
ous applications to chemical problems. However, this paper takes an important 
first step towards molecular engineering within the powerful RKHS framework, 
which allows to utilize many established algorithms, and may provide a basis for 
many other applications. 
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Abstract. We introduce a learning technique for regression between high- 
dimensional spaces. Standard methods typically reduce this task to many one- 
dimensional problems, with each output dimension considered independently. By 
contrast, in our approach the feature construction and the regression estimation 
are performed jointly, directly minimizing a loss function that we specify, sub- 
ject to a rank constraint. A major advantage of this approach is that the loss is no 
longer chosen according to the algorithmic requirements, but can be tailored to the 
characteristics of the task at hand; the features will then be optimal with respect 
to this objective, and dependence between the outputs can be exploited. 



1 Introduction 

The problem of regressing between a high dimensional input space and a continuous, 
univariate output has been studied in considerable detail; classical methods are described 
in [5], and methods applicable when the input is in a reproducing kernel Hilbert space 
are discussed in [8], When the output dimension is high (or even infinite), however, is 
becomes inefficient or impractical to apply univariate methods separately to each of the 
outputs, and specialized multivariate techniques must be used. 

We propose a novel method for regression between two spaces T x and T y , where 
both spaces can have arbitrarily large dimension. Our algorithm works by choosing low 
dimensional subspaces in both T x and T y for each new set of observations made, and 
finding the mapping between these subspaces for which a particular loss is small. 1 There 
are several reasons for learning a mapping between low dimensional subspaces, rather 
than between T x and T y in their entirety. First, T x and T y may have high dimension, 
yet our data are generally confined to smaller subspaces. Second, the outputs may be 
statistically dependent, and learning all of them at once allows us to exploit this de- 
pendence. Third, it is common practice (for instance in principal component regression 
(PCR)) to ignore certain directions in the input and/or output spaces, which decreases 
the variance in the regression coefficients (at the expense of additional bias): this is a 
form of regularization. 

Given a particular subspace dimension, classical multivariate regression methods 
use a variety of heuristics for subspace choice. 2 The mapping between subspaces is then 

1 The loss is specified by the user. 

2 For instance, PCR generally retains the input directions with highest variance, whereas partial 
least squares (PLS) approximates the input directions along which covariance with the outputs 
is high. 
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achieved as a second, independent step. By contrast, our method, Multivariate Regression 
via Stiefel Constraints (MRS), jointly optimises over the subspaces and the mapping; 
its goal is to find the subspace/mapping combination with the smallest possible loss. 
Drawing on results from differential geometry [3], we represent each subspace projection 
operator as an element on a Stiefel manifold. Our method then conducts gradient descent 
over these projections. 

We begin our discussion in Section 2 with some basic definitions, and give a formal 
description of the multivariate regression setting. In Section 3, we introduce the MRS 
procedure for the L 2 loss. Finally, in Section 4, we apply our method in estimating a 
high dimensional image denoising transformation. 



2 Problem Setting and Motivation 

We first describe our regression setting in more detail, and introduce the variables we will 
use. We are given to pairs ofinput and output variables, z := ((xi,yi) . . , (x m ,y m )), 
where x t £ T x , y,; £ F y , and T x and T y are reproducing kernel Hilbert spaces 3 with 
respective dimension l x and l y . We write the matrices of centered observations as 

X := [x x ... x TO ] H, Y:= [yi ...y ro ]H, 

where H:=I — -11 1 . and 1 is the to x 1 matrix of ones. 

m ’ 

We now specify our learning problem: given observations X and Y, and a loss 
function L(Y, X, F( r )), we want to find the best predictor F( r ), defined as 

F (r) = min L(Y, X, G), (1) 

Gr£7x(r) 

where 

H {r ) ■= {F £ T/ x | rank F = r} (2) 

and denotes the set of linear mappings from T : , to T y . This rank constraint is 

crucial to our approach: it allows us to restrict ourselves to subspaces smaller than those 
spanned by the input and/or output observations, which can reduce the variance in our 
estimate of the mapping F, r j while increasing the bias. We select the rank that optimises 
over this bias/variance tradeoff using cross validation. 

As we shall see in Section 3, our approach is not confined to any particular loss 
function. That said, in this study we address only the least squares loss function, 

L 2 (Y,X,F (r) ) = ||Y-F (r) X|||, (3) 

where the F subscript denotes the Frobenius norm. 

3 A reproducing kernel Hilbert space is a Hilbert space J~ x for which at each x £ X, the point 
evaluation functional. 5 X : T x R, which maps / £ T x to f(x) £ R, is continuous. 
To each reproducing kernel Hilbert space, there corresponds a unique positive definite kernel 
k : lxl->R (the reproducing kernel), which constitutes the inner product on this space; 
this is guaranteed by the Moore-Aronszajn theorem. See [7] for details. 
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We now transform the rank constraint in (1) and (2) into a form more amenable to 
optimisation. By diagonalizing the predictor F( r ) via its singular basis, we obtain 

F(r) = V (r) S (r) W (r) T , (4) 

where 

V (r) T V (r) = I r , r , (5) 

W (r) T W (r) = I r , r) (6) 

S £ diagonal R rxr . (7) 

In other words, W( r ) £ S(l y , r ) and V( r ) £ S(l x , r), where S(n, r) is called the Stiefel 
manifold, and comprises the matrices with n rows and r orthonormal columns. In the 
l _2 case, finding a rank constrained predictor (4) is thus equivalent to finding the triplet 

6 = (' V( r ) , 5( r ) , W( r ) ) for which 

9= argmin || Y - V (r) S (r) W (r) T X|||, (8) 

VmAd. w (r) 

subject to constraints (5)-(7). 4 We will refer to W( r ) and V( r ) as feature matrices. 

It is clear from (8) that the columns of W ( r \ and V ( r \ form a basis for particular 
subspaces in T x and T y respectively, and that the regression procedure is a mapping 
between these subspaces. A number of classical multivariate regression methods, such 
as multivariate PCR and PLS, also have this property; although the associated subspaces 
are not chosen to optimise a user specified loss function. In the next section, we introduce 
an optimization technique - based on concepts borrowed from differential geometry - 
to solve (8) directly. 

3 Multivariate Regression via Stiefel Constraints 

In this section, we present a direct solution of the optimisation problem defined in ( 1 ) and 
(2). We begin by noting that the alternative statement of the rank constraint (2), which 
consists in writing the mapping F( r ) in the form (4), still leaves us with a non-trivial 
optimisation problem (8). To see this, let us consider an iterative approach to obtain an 
approximate solution to (8), by constructing a sequence of predictors F^^, . . . , F( r ) . 
such that 

L(X,Y,F (r) .) > L(X, Y,F (r)i+1 ). (9) 

We might think to obtain this sequence by updating V i+1 , Sj+i and W i+ i according 
to their free matrix gradients \ g . , and respectively, where 9i denotes 

the solution (Vj, Wj, S, ) at the 7th iteration ( i. e . , the point at which the gradients are 
evaluated). This is unsatisfactory, however, in that updating V and W linearly along 
their free gradients does not result in matrices with orthogonal columns. 

4 This is a more general form of the Procrustes problem [1] for which F( r ) is orthogonal rather 
than being rank constrained. 
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Thus, to define a sequence along the lines of (9), we must first show how to optimise 
over V and W in such a way as to retain orthogonal columns. As we saw in Section 
2, the feature matrices are elements on the Stiefel manifold; thus any optimisation pro- 
cedure must take into account the geometrical structure of this manifold. The resulting 
optimisation problem is non-convex, since the Stiefel manifold §(n, r) is not a convex 
set. 

In the next subsection, we describe how to update V and W as we move along 
geodesics on the Stiefel manifold §(n, r). In the subsection that follows it, we use these 
updates to conduct the minimisation of the L 2 loss. 

3.1 Dynamics on Stiefel Manifolds 

We begin with a description of the geodesics for the simpler case of §(n, n), followed 
by a generalisation to §(n,r) when n > r. Let 0(n ) denote the group of orthogonal 
matrices. Suppose we are given a matrix V (i) £ O(n) that depends on a parameter 
t, such that for t in some interval [t a ,tb\, V(f) describes a geodesic on the manifold 
O(n). Our goal in this subsection is to describe how V(f) changes as we move along 
the geodesic. Since 0(n ) is not only a manifold but also a Lie group (a special manifold 
whose elements form a group), there is an elegant way of moving along geodesics which 
involves an exponential map. We will give an informal but intuitive derivation of this 
map; for a formal treatment, see [3,2]. We begin by describing a useful property of the 
derivative of V(f ) ; 

I = V(i) T V(f), 

o = J t (V(t) T V(f)) , 

°=(l v(t) ) Tv(,)+v(,)T Gs v(,) )- 

0 = Z(f) T + Z(t), 



with 

m ~ vw T (|v(f)) . do 

It follows that Z(i) is skew symmetric, which we write as Z(t) £ ~(n, n), where ~ 
consists of the set of all skew symmetric matrices of size n x n. For particular curves 
corresponding to 1-parameter subgroups of 0(n ), we can show that Z(t) is constant. 
Thus (10) becomes an ordinary differential equation of the form 

|v(f) = V(f)Z (11) 

with V (0) = V, 



which has solution 



V(i) = V(0)e fZ , 



( 12 ) 




266 



G.H. Baku et al. 



where e z denotes the matrix exponential [4]. 5 We can see from (11) that the skew- 
symmetric matrix Z specifies a tangent at point V (0) on the Stiefel manifold §(?r, n). 

We now generalize to the case where V does not have full rank, i.e. V £ S(n, r) and 
r < n. We can embed S(n, r) into 0(n) by extending any V £ S(n, r) with ann-r 
matrix Vj_ £ §(n,?i — r) such that R" = Vi0V. Therefore V ± spans the orthogonal 
complement to the space spanned by the columns of V. Two orthogonal matrices A, B 
in 0(n ) are considered to be the same from the viewpoint of S(n, r) if they relate as 

B = [I n ,r> P] A (13) 

for any matrix P £ S(n, n — r), where I„ r contains the first r columns of the n x n 
unit matrix. We can thus replace V(£) by [V(t), Vj_(<)] in (11) (with a small abuse of 
notation), to get 

4[V(i),Vx(i)] =[V( 0 ),Vx( 0 )]Z (14) 

az t = 0 

and 

V(f) = [V(0), Vj_(0)] e tz [I n , r ,0] . (15) 

For the gradient descent, we need to find the tangent direction G (which replaces the 
tangent direction Z in (12)) that is as close as possible to the free gradient G, since 
G does not generally have the factorization (11). The constrained gradient G can be 
calculated directly by projecting the free gradient onto the tangent space of V; see [3]. 
Intuitively speaking, we do this by removing the symmetric part of G T V, leaving the 
skew symmetric remainder. 6 The constrained gradient is thus 

G = G-VG t V. (16) 



Finally, the skew symmetric matrix Z £ R" x " is given by (10) as 

A ( G T V -(G T VJ T \ 

Z ~U T v ± 0 )’ 



(17) 



We can now describe the nonlinear update operator 7Tstiefei - 



Algorithm 1 (7Ts t j e f e i(V, G, t)) Given a free gradient G £ R n ’ r , a matrix V £ S(n, r) 
with orthogonal columns, and a scalar step parameter 7, the update ofV specified by 
G and 7 can be calculated as follows: 

1 ) Calculate constrained gradient in (16). 

2) Calculate basis 'V±for the orthogonal complement ofV. 

3) Calculate the tangent coordinates Z in (17). 

4) V( 7 ) = [V, Vj_]e 7Z [I„ ir , 0], 

5 When verifying that (12) is indeed a solution for (11), note that the skew-symmetric matrix Z 
is normal, i.e. ZZ T = Z T Z. 

6 Note that any square matrix can be expressed as a unique sum of a symmetric and a skew- 
symmetric matrix. 




Multivariate Regression via Stiefel Manifold Constraints 



267 



3.2 Multivariate Regression with L 2 Loss 

Now that we have defined 7Ts t j e f e i, we can apply a gradient descent approach to (8). We 
first calculate the free gradients, 

a I 

=-YX T W i S i , (18) 

= — XY T VjS'j + XX T W i: S t 2 , (19) 

a W 

-Qgle, = -Ir,r®WjXY T Vi 

+ I r , r 0 w7 XX T WjSi, (20) 

where 0 denotes the Hadamard (element-wise) product. The multivariate regression 
algorithm for the l_ 2 loss is then: 

Algorithm 2 MRS for l_ 2 loss function. 

Initialization 

= I l x ,r So = I r,r XV o = It y ,r 

0 o = (V o ,5o,W o ) F (r)o = W 0 5 0 V 0 i = 0 
Repeat until convergence: 

1) Calculate free gradients, equations (18)-(20) 

2) tv,t* s ,t w = argminL 2 (V(f v ),<S(f), W(f w )) 

t\f,ts,tw 

with W (t"W ) — 7T stiefel I )’ 

V(tv) — 7T stief el(y it 
and S(t) = Si + i s § §k 

5) Vj +1 = V (t^) 7 W i+1 = W(i^), 5 i+1 = S(t* s ) 

4) F (r) , +1 = V i+1 S i+1 W i+1 

5) 9 i+1 = (V i+1 ,S i+1 , W i+1 ) 

6) i = i + 1 

After convergence : F( r ) = F( r ). 



4 Application Example: Image Restoration 

We demonstrate the application of MRS to an artificial image restoration task. The goal 
is to restore the corrupted part of an image, given examples of corrupted images and the 
corresponding clean images. The images are taken from the USPS postal database, which 
consists of 16 x 16 grayscale patches representing handwritten digits. We independently 
perturbed the gray values of each pixel in the lower half of each image with Gaussian 
noise having standard deviation 0. 1 . Our data consisted of 2000 digits chosen at random, 
with 1000 reserved for training. 

To perform restoration, we first applied kernel PCA to extract 500 nonlinear features 
from the noisy digits, 7 using a Gaussian kernel of width 10. Thus the restoration task 

7 For more detail on this feature extraction method, see [8] 
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Fig. 1 . Example of image denoising using MRS. Each column contains a hand-written digit chosen 
at random from the 1000 images in our test set. The first row displays the original images, the 
second row contains the noisy images, and the third row shows the images as reconstructed using 
MRS. 



is a regression problem with a 500 dimensional input space T x , where we predict the 
entire clean digits in a 256 dimensional output space T y . 

In our experiments, we compared ridge regression (RR), PLS, and MRS. We used 
the ridge parameter le — 6, which we optimised using 5-fold cross validation. For our 
PLS solution, we used a rank 123 mapping, again finding this optimal rank with 5-fold 
cross validation. We initialised MRS using a low rank approximation to the predictor 
F ( r ) RR found by ridge regression. To do this, we decomposed as U5V T via 

the singular value decomposition, and set Uo and Vo to be the first 110 components 
(determined by cross validation on the MRS solution) of U and V respectively, while 
initialising So = I. We give sample outcomes in Figure 1, and a summary of the results 
in Table 1. We also give the result obtained by simply using the first 110 components of 
the SVD of F (r) RR : the performance is worse than both ridge regression and MRS. 

In this image restoration task, it appears that MRS performs substantially better 
than PLS while using a lower dimensional mapping, which validates our method for 
optimising over input and output subspaces. In addition, MRS has a small advantage over 
ridge regression in eliminating irrelevant features from the subspaces used in prediction 
(whereas RR shrinks the weights assigned to all the features). 



Table 1 . Test error (using squared loss) of MRS, PLS, and RR for the digit restoration problem, 
with results averaged over 1000 digits. The first column gives the performance of RR alone. The 
second column uses a low rank (i.e. rank 110) approximation to the RR solution. The third and 
fourth columns respectively show the PLS and MRS results with the rank in parentheses, where 
MRS was initialised using the low rank RR solution. 





RR 


RR(110) 


PLS(123) 


MRS(llO) 


RMSE 


552.5 ±0.1 


554.9 ±0.1 


648.5 ±0.1 


550.53 ±0.1 
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5 Conclusions 

We have introduced a novel method, MRS, for performing regression on multivariate 
problems of high dimension, and demonstrated its performance in removing noise from 
images. We anticipate that this method will readily generalise to additional settings with 
high dimensional outputs: examples include regression to a reproducing kernel Hilbert 
space (of high or even infinite dimension), which can be used to recover images from 
incomplete or corrupted data; or as a means of classification, through mapping to a 
suitable output space [10]; and regression between discrete spaces, such as graphs [9,6], 
on which similarity measures may be defined via kernels. 
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Abstract. In this article we investigate the field of Hilbertian metrics on 
probability measures. Since they are very versatile and can therefore be 
applied in various problems they are of great interest in kernel methods. 
Quit recently Topspe and Fuglede introduced a family of Hilbertian met- 
rics on probability measures. We give basic properties of the Hilbertian 
metrics of this family and other used metrics in the literature. Then we 
propose an extension of the considered metrics which incorporates struc- 
tural information of the probability space into the Hilbertian metric. 
Finally we compare all proposed metrics in an image and text classifica- 
tion problem using histogram data. 



1 Introduction 

Recently the need for specific design of kernels for a given data structure has been 
recognized by the kernel community. One type of structured data are probability 
measures M^X^on a probability space X. The following examples show the 
wide range of applications of this class of kernels: 

— Direct application on probability measures e.g. histogram data [1], 

— Having a statistical model for the data one can first fit the model to the data 
and then use the kernel to compare two fits, see [5,4]. 

— Given a bounded probabiliy space X one can use the kernel to compare sets 
in that space, by putting e.g. the uniform measure on each set. 

In this article we study instead of positive definite (PD) kernels the more general 
class of conditionally positive definite (CPD) kernels. Or to be more precise we 
concentrate on Hilbertian metrics, that are metrics d which can be isometrically 
embedded into a Hilbert space, that is — d 2 is CPD. This choice can be justified 
by the fact that the support vector machine (SVM) only uses the metric infor- 
mation of the CPD 2 kernel, see [3], and that every CPD kernel is generated by 
a Hilbertian metric. 

We propose a general method to build Hilbertian metrics on M.\(X) from 

1 M\(X) denotes the set of positive measures (jonf with n{X) = 1 

2 Note that every PD kernel is a CPD kernel. 



C.E. Rasmussen et al. (Eds.): DAGM 2004, LNCS 3175, pp. 270-277, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




Hilbertian Metrics on Probability Measures and Their Application in SVM’s 



271 



Hilbertian metrics on IR + . Then we completely characterize the Hilbertian met- 
rics on which are invariant under the change of the dominating measure 

using results of Fuglede. As a next step we introduce a new family of Hilbertian 
metrics which incorporates similarity information of the probability space. Fi- 
nally we support the theoretical analysis by two experiments. First we compare 
the performance of the basic metrics on probability measures in an image and 
text classification problem. Second we do the image classification problem again 
but now using similarity information of the color space. 



2 Hilbertian Metrics 

An interesting subclass of metrics is the class of Hilbertian metrics, that are 
metrics which can be isometrically embedded into a Hilbert space. In order to 
characterize this subclass of metrics, we first introduce the following function 
class: 

Definition 1. A real valued function k on X x X is positive definite (PD) 
(resp. conditionally positive definite (CPD)) if and only if k is symmetric 
and Y7j c i c jk(xi, Xj) > 0, for all n € N, Xi € X,i = l,...,n, and for all 
Ci £ 1R, * = 1, ..., n, (resp. for all Ci € 1R, * = 1, ..., n, with Yi C = 0)- 

The following theorem describes the class of Hilbertian metrics: 

Theorem 1 (Schoenberg [6]). A metric space (X,d) can be embedded iso- 
metrically into a Hilbert space if and only if —d 2 (x,y) is CPD. 

What is the relevance of this notion for the SVM? Scholkopf showed that the 
class of CPD kernels can be used in SVM’s due to the translation invariance of 
the maximal margin problem in the RKHS, see [7]. Furthermore it is well known 
that the maximal margin problem is equivalent to the optimal separation of the 
convex hulls of the two classes. This was used in [3] to show that the properties 
of the SVM only depend on the Hilbertian metric. That is all CPD kernels are 
generated by a Hilbertian metric d(x, y) through k(x, y) = —d 2 ( x, y)+g(x)+g(y) 
where g : X — > 1R and the solution of the SVM only depends on the Hilbertian 
metric d(x,y). 



3 Hilbertian Metrics on Probability Measures 

It would be very ambitious to address the question of all possible Hilbertian 
metrics on probability measures. Instead we restrict ourselves to a special family. 
Nevertheless this special case encompasses almost all measures previously used 
in the machine learning community. In the first section we use recent results of 
Fuglede and Topspe, which describe all a-lromogeneous 3 , continuous Hilbertian 
(semi)-metrics on 1R + 4 . Using these results it is straightforward to characterize 
all Hilbertian metrics on A4+(X) of a certain from. In the second part we extend 
the framework and incorporate similarity information of X . 

3 That means d 2 (cp,cq) = c a d 2 (p,q) for all c £ 1R+ 

4 IR+ is the positive part of the real line with 0 included 
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3.1 Hilbertian Metrics on Probability Measures Derived from 
Hilbertian Metrics on ]R_|_ 

For simplicity we will first only treat the case of discrete probability measures 
on D = {1,2, . . . , TV}, where 1 < TV < oo. Given a Hilbertian metric d on ]R + 
it is easy to see that the metric d M i + given by d 2 M i ( P,Q ) = J^iLi d M+(P^ di) 

is a Hilbertian metric on A The following proposition extends the simple 
discrete case to the general case of a Hilbertian metric on a probability space 
X . In order to simplify the notation we define p(x) to be the Radon-Nikodym 
derivative ( dP/dp)(x ) 5 of P with respect to the dominating measure /i. 

Proposition 1. Let P and Q be two probability measures on X, p an arbitrary 
dominating measure 6 of P and Q and d]R + a 1/2-homogeneous Hilbertian metric 
on IR + . Then d M i^( X ) defined as 

d Ml(x)( P ’Q) : = / dl l+ (p(x),q(x))dp(x), (1) 

J ?c 

is a Hilbertian metric on Ai\(X). d M i ( X ) independent of the dominating 
measure p. 



Proof. First we show by using the 1 /2-lromogeneity of c?ir + that d M i+( X ) is in- 
dependent of the dominating measure p. We have 



f ,2 fdP dQ 
l x m+ dp ’ dp 



)dp = J ^ 



o , dP dv dQ dv .dp f 2 . dP dQ 

x r 1 * 7 Tp ■ * 7 if - 1 iv = j x d ^' < ' * 7 >' 



where we use that is 1-homogeneous. It is easy to show that —d 2 M i ^ 
is conditionally positive definite, simply take for every n £ N, Pi, . . . ,P n the 
dominating measure Pt and use that — d^ + is conditionally positive definite. 

It is in principle very easy to construct Hilbertian metrics on M.\(X) using an 
arbitrary Hilbertian metric on 1R + and plugging it into the definition (1). But 
the key property of the method we propose is the independence of the metric d 
on Ai+(X) of the dominating measure. That is we have generated a metric which 
is invariant with respect to general coordinate transformations on X , therefore 
we call it a covariant metric. For example the euclidean norm on 1R + will yield 
a metric on Xi\(X) but it is not invariant with respect to arbitrary coordinate 
transformations. We think that this could be the reason why the naive appli- 
cation of the linear or the Gaussian kernel yields worse results than Hilbertian 
metrics resp. kernels which are invariant, see [1,5]. 

Quite recently Fuglede completely characterized the class of homogeneous 
Hilbertian metrics on 1R + . The set of all 1/2-homogeneous Hilbertian metrics 
on 1R + characterizes then all invariant Hilbertian metrics on Xi]_(X) of the 
form (1). 

5 In IR n the dominating measure p is usually the Lebesgue measure. In this case we 
can think of p(x) as the normal density function. 

6 Such a dominating measure always exists take, e.g. M = (P + Q)/ 2. 
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Theorem 2 (Fuglede [2]). A symmetric function d : 1R+ x 1R + — > IR + with 
d(x, y) = 0 •£=> x = y is a 7 - homogeneous , continuous Hilbertian metric d on 
1R.+ if and only if there exists a (necessarily unique) non-zero bounded measure 
/i > 0 on 1R+ such that d 2 can be written as 



d 2 (x, y) = [ 



x h+ iX ) — y( 7+*A) 



dfi{ A) 



JIR+ 1 1 

Tops 0 e proposed the following family of 1/2-homogeneous Hilbertian metrics. 
Theorem 3 (Topspc. Fuglede). The function d : IR+ x ]R + — > ]R. defined as: 



d 2 a \ p{x,y) 



a/3 f x a + y a \ 1 ^“ 

I ^ V 2 ) 



rP A- 1 P 



1//31 



(2) 



is a 1/2-homogeneous Hilbertian metric on 1R+, if 1 < a < 00 , 1/2 < j3 < a. 
Moreover —d 2 is strictly CPD except when a = (3 or (a, (3) = (1, 1/2). 

Obviously one has d 2 a ^ = d^, . Abusing notation we denote in the following 
the final metric on M\(X) generated using (1) by the same name The 
following special cases are interesting: 



dl {1 (P,Q)=l 

d 2 1{1 (P,Q) = l 



j \p{x)-q{x)\dp{x), d|| 1 (P,Q) = | J Wp{x) - a/ qfx)) 2 dy(x ) 



p(x) log 



IX 



2pQ) 

p{x) + q(x) 



q{x) log 



2 q(x) 

P(x) + q(x) 



dy(x) 



d 2 oo\\ is the total variation 7 . d\^ is the square of the Hellinger distance. It is in- 
duced by the positive definite Mrattacharyya kernel, see [4]. d\^ can be derived 
by a limit process, see [2], It was not used in the machine learning literature 
before. Since it is the proper version of a Hilbertian metric which corresponds to 
the Kullback-Leibler divergence D(P\\Q), it is especially interesting. In fact it 
can be written with M = (P + Q)/2 as d 2 ^(P,Q) = \ ( D(P\\M ) + D(Q\\M)). 
For an interpretation from information theory, see [9] . We did not consider other 
metrics from this family since they all have similar properties as we show later. 
Another 1/2-homogeneous Hilbertian metric previously used in the machine 

learning literature is the modified x 2 -distance : d 2 2 (P, Q) = . d 2 2 is 

not PD, as often wrongly assumed in the literature, but CPD. See [ 8 ] for a proof 
and also for the interesting upper and lower bounds on the considered metrics: 



d\ | ! — d 2 aW < d' 



qi < 2^1!’ 



4:d\ | X < ^x 2 — 



2d: 



3 |1 ^ dy2 ^ 2d, 



'2 

00 1 1 



In order to compare all different kinds of metrics resp. kernels on (A) 
which were used in the kernel community, we also considered the geodesic 
distance of the multinomial statistical manifold used in [5]: d geo (P 1 Q) = 

' This metric was implicitly used before, since it is induced by the positive definite 
kernel k(P, Q ) = J2i=i min (pi, <7i)- 
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arccos(^^ =1 y/piqi). We could not prove that it is Hilbertian. In [5] they ac- 
tually use the kernel exp(— A d 2 (P,Q)) as an approximation to the first order 
parametrix of the heat kernel of the multinomial statistical manifold. Despite 
the mathematical beauty of this approach, there remains the problem that one 
can only show that this kernel is PD for A < e 8 . In practice e is not known which 
makes it hard to judge when this approach may be applied. 

It is worth mentioning that all the Hilbertian metrics explicitly mentioned in 
this section can be written as /-divergences. It is a classical result in informa- 
tion geometry that all /-divergences induce up to scaling the Fisher metric. In 
this sense all considered metrics are locally equivalent. Globally we have the up- 
per and lower bounds introduced earlier. Therefore we expect in our experiments 
relatively small deviations in the results of the different metrics. 

3.2 Hilbertian Metrics on Probability Measures Incorporating 
Structural Properties of the Probability Space 

If the probability space X is a metric space (X, dx) one can use dx to derive a 
metric on j\4^_(X). One example of this kind is the Kantovorich metric: 



d K {P,Q) = inf{ [ d(x,y)dy(x,y) y G M]_(X x X),m(y) = P,n 2 {y) = Q} 
v Jxxx 



where 7r,; denotes the marginal with respect to i-th coordinate. When X is finite, 
the Kantovorich metric gives the solution to the mass transportation problem. In 
a similar spirit we extend the generation of Hilbertian metrics on M.\(X) based 
on (1) by using similarity information of the probability space X . That means 
we do not only compare the densities pointwise but also the densities of distinct 
points weighted by a similarity measure k(x,y) on X. The only requirement we 
need is that we are given a similarity measure on X , namely a positive definite 
kernel k(x,y) 9 . The disadvantage of our approach is that we are not anymore 
invariant with respect to the dominating measure. On the other hand if one 
can define a kernel on X , then one can build e.g. by the induced semi-metric a 
uniform measure y on X and use this as a dominating measure. We denote in 
the following by A i + (X,y) all probability measure which are dominated by y. 

Theorem 4. Let k be a PD kernel on X and k a PD kernel on 1R+ such that 

fx \J k{x, x) k(q(x),q(x)) dy(x) < oo, V q G Ai+(X, y). Then 



k(p,Q)=[ [ Hx,y)k(p(x),q(y))dy(x)dy(y) (3) 

J x Jx 

is a positive definite kernel on M.\{X,y) x Ai]_(X, y). 

Proof. Note first that the product k(x. y)k(r, s ) (x, y G X,r,s G 1R+) is a positive 
definite kernel on X x 1R + . The corresponding RKHS TL is the tensor product 

8 which does not imply that — d 2 eo is CPD. 

9 Note that a positive definite kernel k on X always induces a semi-metric on X by 
d%(x,y) = k(x,x) + k(y,y ) - 2k(x,y). 
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of the RKHS Hk and Hy that is H = Hk <£> Hy We denote the corresponding 
feature map by (x, r) — > (p x ® ip r . Now let us define a linear map L q : % — > 1R by 



Lq • (px & Ip r 



IX 



k(x,y)k(r,q(y))dfj,(y ) = / {<p x ,<P y ) n . (Vv, ipq{y)) H „ dy(y) 



ix 



< \\(px ® IprWn \\ ( Py®'IPq(y)\\H d ^y) 



IX 



Therefore by the assumption L q is continuous. By the Riesz lemma, there exists 
a vector u q such that Vv € ( u q ,v) H = L q (v). It is obvious from 



( u P ,Uq) H =J (u P ,<Py®' t Pq( y )) n diJ,(y) 




® ipp(x)Ay ® iPq(y)) H dy{y)dy(x) 




k(x, y) k(p(x),q(y)) dy(x) dy{y) 



that K is positive definite. 



The induced Hilbertian metric D of I\ is given by 



D (P,Q)= k(x, y) k(p(x),p(y)) + k(q(x),q(y)) -2k(p(x),q(y)) dy(x)dy(y) 
Jx 2 L J 



/ / Hx,y)(ipp( x )-'ip q ( x ),'ip p (y)-'ip q (y))dn(x)dy(y). 

Jx Jx 



( 4 ) 



4 Experiments 



The performance of the following Hilbertian metrics on probability distributions 



N 



N 



d 2 geo (P , Q) = arccos 2 ^ JPiJqP), d\ 2 {P, Q) = ^ 



i=l 



N 



i=l 

N 



(. Pi - g») 2 
Pi + qi 



d 2 H (P , Q) = 2 J2(VPi - Vdi) 2 , d^ v (P, Q) = l^2\Pi ~ qi\ 



i= 1 

N 



i=l 



djs( p ,Q) = xJ2 Pilog 



2 Pi 

Pi + qi 



qi log 



2 qi 

Pi + qi 



( 5 ) 



respectively of the transformed ’’Gaussian” metrics 



d 2 (P , Q) = 1 - exp(— Ad 2 (P, Q)) 



( 6 ) 



was evaluated in three multi-class classification tasks: 

The Reuters data set. The documents are represented as term histograms. Fol- 
lowing [5] we used the five most frequent classes earn , acq, moneyFx , grain and 
crude. We excluded documents that belong to more than one of theses classes. 
This resulted in a data set with 8085 examples of dimension 18635. The We- 
bKB web pages data set. The documents are also represented as histograms. We 
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used the four most frequent classes student, faculty, course and project. 4198 
documents remained each of dimension 24212 (see [5]). The Corel image data 
base. We chose the data set Corell4 as in [1], which has 14 classes. Two different 
features were used. First the histogram was computed directly from the RGB 
data second from the CIE Lab color space, which has the advantage that the eu- 
clidean metric in that space locally discriminates colors according to the human 
vision uniformly over the whole space. Therefore the quantization process is more 
meaningful in CIE Lab than in RGB space 10 . In both spaces we used 16 bins per 
dimension, yielding a 4096-dimensional histogram. All the data sets were split 
into a training (80%) and a test (20%) set . The multi-class problem was solved by 
one-vs-all with SVM’s using the CPD kernels K = —d 2 . For each metric d from 
(5) we either used the metric directly with varying penalty constants C in the 
SVM, or we used the transformed metric d exp defined in (6) again with different 
penalty constants C and A. The best parameters were found using 10-folds cross- 
validation from the set C £ {10 fc | k = —2, —1, ...,4} =: Rc respectively ( C , A) £ 
Rc x y{2, 1, §, §, §, i, ^}, where cr was set to {f , A/, A/, 
to compensate for the different maximal distances of d geo ,d x 2 ,dH,djs,dTv re- 
spectively. For the best parameters the classifier was trained then on the whole 
training set and its error evaluated on the test set. The results are shown in Ta- 
ble 1. In a second experiment we used (4) for the Corel data 11 . We employ the 
euclidean CIE 94 distance on the color space since it models the color perception 
of humans together with the compactly supported RBF k(x, y) = (1— ||cc — 2 /||)+, 
see e.g. [10], to generate a similarity kernel for the color space. Then the same 
experiments are done again for the RGB histograms and the CIE histograms 
with all the distances except the geodesic one, since it is not of the form (1). 
The results are shown in rows CIE CIE94 and RGB CIE94The results show that 



Table 1 . The table shows the test errors with the optimal values of the param- 
eters of C resp. C, A found from 10-fold cross-validation. The first row of each 
data set is obtained using the metric directly, the second row shows the errors 
of the transformed metric (6). 
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10 In principle we expect no difference in the results of RGB and CIE Lab when we use 
invariant metrics. The differences in practice come from the different discretizations. 

11 The geodesic distance cannot be used since it cannot be written in appropriate form. 
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there is not a ’’best” metric. It is quite interesting that the result of the direct 
application of the metric are comparable to that of the transformed ” Gaussian” 
metric. Since the ’’Gaussian” metric requires an additional search for the opti- 
mal width parameter, in the case of limited computational resources the direct 
application of the metric seems to yield a good trade-off. 

5 Conclusion 

We presented a general method to build Hilbertian metrics on probability mea- 
sures from Hilbertian metrics on IR + . Using results of Fuglede we characterized 
the class of Hilbertian metrics on probability measures generated from Hilber- 
tian metrics on 1R + which are invariant under the change of the dominating 
measure. We then generalized this framework by incorporating a similarity mea- 
sure on the probability space into the Hilbertian metric. Thus adding structural 
information of the probability space into the distance. Finally we compared all 
studied Hilbertian metrics in two text and one image classification tasks. 
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Abstract. In this paper image-based techniques for 3D surface recon- 
struction are presented which are especially suitable for (but not limited 
to) coplanar light sources. The first approach is based on a single-image 
shape from shading scheme, combined with the evaluation of at least two 
further images of the scene that display shadow areas. The second ap- 
proach allows the reconstruction of surfaces by an evaluation of quotients 
of images of the scene acquired under different illumination conditions 
and is capable of separating brightness changes due to surface shape from 
those caused by variable albedo. A combination of both techniques is sug- 
gested. The proposed approaches are applied to the astrogeological task 
of three-dimensional reconstruction of regions on the lunar surface using 
ground-based CCD images. Beyond the planetary science scenario, they 
are applicable to classical machine vision tasks such as surface inspection 
in the context of industrial quality control. 



1 Introduction 

A well-known method for image-based 3D reconstruction of surfaces is shape from 
shading (SFS). This technique aims at deriving the orientation of the surface at 
each pixel by using a model of the reflectance properties of the surface and 
knowledge about the illumination conditions. 

1.1 Related Work 

Traditional applications of such techniques in planetary science, mostly referred 
to as photoclinometry, rely on single images of the scene and use line-based, in- 
tegrative methods designed to reveal a set of profiles along one-dimensional lines 
rather than a 3D reconstruction of the complete surface [1,5]. In contrast to 
these approaches, shape from shading and photometric stereo techniques based 
on the minimization of a global error term for multiple images of the scene have 
been developed in the field of computer vision - for detailed surveys on the 
SFS and photometric stereo methodology see [1,3]. For a non-uniform surface 
albedo, however, these approaches require that the directions of illumination 
are not coplanar [3]. Recent work in this field [6] deals with the reconstruction 
of planetary surfaces based on shape from shading by means of multiple images 
acquired from precisely known locations at different illumination conditions, pro- 
vided that the reflectance properties of the surface are thoroughly modelled. In 
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[7] shadows are used in the context of photometric stereo with multiple non- 
coplanar light sources to recover locally unique surface normals from two image 
intensities and a zero intensity caused by shadow. 

1.2 Shape from Shading: An Overview 

For a single image of the scene, parallel incident light and an infinite distance 
between camera and object the intensity I(u, v ) of image pixel (u, v) amounts to 

I(u,v) = Kli$(n(x,y,z),s,v) . (1) 

Here, k is a camera constant, v the direction to the camera, the intensity of 
incident light and s its direction, and <P the so-called reflectance function. A well- 
known example is the Lambertian reflectance function <P (n, s) = a cos 8 with 
8 = Z(n, s) and cr as a surface-specific constant. The product nfa = p(u,v) is 
called surface albedo. In the following, the surface normal n will be represented 
by the directional derivatives p = z x and q = z y of the surface function z(u,v) 
with n = (— p, —q, 1). The term R(p, q) = nli<d> is called reflectance map. Solving 
the SFS problem requires to determine the surface z(u, v) with gradients p(u, v) 
and q{u, v) that minimizes the average deviation between the measured pixel in- 
tensity I(u,v) and the modelled reflectance R(p(u,v),q(u,v)). This corresponds 
to minimizing the intensity error term 

e i = Yl - R((p(u,v),q(u,v))] 2 . (2) 

U,V 

Surface reconstruction based on a single monocular image with no constraints is 
an ill-posed problem as for a given image I(u, v) there exists an infinite number of 
minima of e, for the unknown values of p(u, v ), q(u, v ), and p(u, v). A well-known 
method to alleviate this ambiguity consists of imposing regularization constraints 
on the shape of the surface. A commonly used constraint is smoothness of the 
surface, implying small absolute values of the directional derivatives p x and q x 
of the surface gradients p and q (cf. [1,3]). This leads to an additional error term 

e s = ^ [Px + p 2 y + q 2 x + ql\ ■ (3) 

U,V 

Solving the problem of surface reconstruction then consists of globally minimiz- 
ing the overall error function e = e s + A e^, where the Lagrangian multiplier A 
denotes the relative weight of the error terms. As explained in detail in [1-3], 
setting the derivatives of e with respect to the surface gradients p and q to 
zero leads to an iterative update rule to be repeatedly applied pixelwise until 
convergence of p(u, v ) and q(u, v ) is achieved. Once the surface gradients are de- 
termined, the surface profile z(u,v) is obtained by numerical integration of the 
surface gradients as described in [3]. According to [2], constraint (3) can be re- 
placed by or combined with the physically intuitive assumption of an integrable 
surface gradient vector field within the same variational framework. 
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The ambiguity of the solution of the shape from shading problem can only 
be completely removed, however, by means of photometric stereo techniques, 
i. e. by making use of several light sources. A traditional approach is to acquire 
L = 3 images of the scene (or an even larger number) at different illumination 
conditions represented by sp l = 1 As long as these vectors are not 

coplanar and a Lambertian reflectance map can be assumed, a unique solution 
for both the surface gradients p{u, v ) and q(u, v ) and the non-uniform surface 
albedo p{u,v) can be obtained analytically in a straightforward manner [3]. 

In many application scenarios, however, it is difficult or impossible to obtain 
L > 3 images acquired with non-coplanar illumination vectors sp l = 1, ... ,L. 
For example, the equatorial regions of the Moon (only these appear nearly undis- 
torted for ground-based telescopes) are always illuminated either exactly from 
the east or exactly from the west, such that all possible illumination vectors s 
are coplanar. The illumination vectors are thus given by si = (— cot pi , 0, 1) with 
pi denoting the solar elevation angle for image l. Identical conditions occur e. g. 
for the planet Mercury and the major satellites of Jupiter. In scenarios beyond 
planetary science applications, such as visual quality inspection systems, there 
is often not enough space available to sufficiently distribute the light sources. 

Hence, this paper proposes shape from shading techniques based on multiple 
images of the scene, including the evaluation of shadows, which are especially 
suitable for (but not limited to) the practically very relevant case of coplanar 
light sources, and which do not require a Lambertian reflectance map. 



2 Shadow-Based Initialization of the SFS Algorithm 

A uniform albedo p(u, v) = po and oblique illumination (which is necessary to 
reveal subtle surface details) will be assumed throughout this section. Despite 
this simplification the outcome of the previously described SFS scheme is highly 
ambigous if only one image is used for reconstruction, and no additional infor- 
mation is introduced by further shading images due to the coplanarity of the 
illumination vectors. Without loss of generality it is assumed that the scene is 
illuminated exactly from the left or the right hand side. Consequently, the sur- 
face gradients q{u, v) perpendicular to the direction of incident light cannot be 
determined accurately for small illumination angles by SFS alone unless further 
constraints, e. g. boundary values of z(u,v) [1], are imposed. 

Hence, a novel concept is introduced, consisting of a shadow analysis step per- 
formed by means of at least two further images (in the following called “shadow 
images”) of the scene acquired under different illumination conditions (Fig. la). 
All images have to be pixel-synchronous such that image registration techniques 
(for a survey cf. [4]) have to be applied. After image registration is performed, 
shadow regions can be extracted either by a binarization of the shadow image 
or by a binarization of the quotient of the shading and the shadow image. The 
latter technique prevents surface parts with a low albedo from being erroneously 
classified as shadows; it will therefore be used throughout this paper. A suitable 
binary threshold is derived by means of histogram analysis in a straightforward 
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Fig. 1. Shadow-based initialization of the SFS algorithm, (a) Shadow and shading 
images. The region inside the rectangular box is reconstructed, (b) Surface part between 
the shadows, (c) Initial 3D profile zo(u,v) of the surface patch between the shadows. 



manner. The extracted shadow area is regarded as being composed of S shadow 
lines. As shown in Fig. lb, shadow line s has a length of l ^ pixels. This corre- 
sponds to an altitude difference (Az) adadow = Z( s ) tan/Zghadow, where the angle 
Mshadow denotes the elevation angle of the light source that produces the shadow. 

Altitude difference (Az)^ adow can be determined at high accuracy as it is 
independent of a reflectance model. It is used to introduce an additional shadow- 
based error term e z into the variational SFS scheme according to 

s 

e * = E 

S=1 

This leads to a minimization of the overall error term e = e s + Ae^ + ije z with r] 
as an additional Lagrangian multiplier, aiming at an adjustment of the altitude 
difference (Az)$ measured on the surface profile along the shadow line to the 
altitude difference obtained by shadow analysis. For details cf. [8]. 

This concept of shading and shadow based 3D surface reconstruction is ex- 
tended by initializing the SFS algorithm based on two or more shadow images, 
employing the following iterative scheme: 

1. Initially, it is assumed that the altitudes of the ridges casting the shadows 
(solid line in Fig. lb) are constant, respectively, and identical. The iteration 
index to is set to to = 0. The 3D profile z m (u,v) of the small surface patch 
between the two shadow lines (hatched area in Fig. lb) is derived from the 
measured shadow lengths (Fig. lc). 

2. The surface profile z m (u, v ) directly yields the surface gradients po(u, v) and 
qo(u. v) for all pixels belonging to the surface patch between the shadow 
lines. They are used to compute the albedo po, serve as initial values for 
the SFS algorithm, and will be kept constant throughout the following steps 
of the algorithm. Outside the region between the shadow lines, po(u,v ) and 
qo(u,v) are set to zero. 

3. Using the single-image SFS algorithm with the initialization applied in step 2, 
the complete surface profile z m (u,v) is reconstructed based on the shading 
image. The resulting altitudes of the ridges casting the shadows are ex- 
tracted from the reconstructed surface profile z m (u,v). This yields a new 
profile z m +\ (u. v) for the surface patch between the shadow lines. 



(Az)i2 - (Az)^ adow 






( 4 ) 
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4. The iteration index m is incremented (m := m + 1). The algorithm cycles 
through steps 2, 3, and 4 until it terminates once convergence is achieved, i. e. 
(( z m (u , v ) — v)) 2 ) u v < 0 Z . A threshold value of O z = 0.01 pixels is 

applied for termination of the iteration process. 

This approach is applicable to arbitrary reflectance functions R{p , q). It mutually 
adjusts in a self-consistent manner the altitude profiles of the floor and of the 
ridges that cast the shadows. It allows to determine not only surface gradients 
p{u,v) in the direction of incident light, as it can be achieved by SFS without 
additional constraints, but to estimate surface gradients q(u, v) in the perpen- 
dicular direction as well. Furthermore, it can be extended in a straightforward 
manner to more than two shadows and to shape from shading algorithms based 
on multiple light sources or regularization constraints beyond those described by 
eq. (3) and (4). 



3 Quotient-Based Photometric Stereo 



The second approach to SFS under coplanar light sources copes with a non- 
uniform albedo p{u, v) and the very general class of reflectance maps given by 
R(p,p,q) = p(u,v)R(p,q). At least two pixel-synchronous images of the scene 
acquired under different illumination conditions and containing no shadow ar- 
eas are required. For each pixel position (u,v), the quotient Ii(u, v)/I 2 {u, v) of 
pixel intensities is desired to be identical to the quotient R\ (u, v)/R 2 (w, v ) of 
reflectances. This suggests a quotient-based intensity error term 



e* = 



E 



u,v 



( h(u,y)R 2 {u,v) _ A 
\h{u,v)Ri{u,v) ) 



(5) 



which is independent of the albedo (cf. [5] for a quotient-based approach for 
merely one-dimensional profiles) . This error term can easily be extended to L > 2 
images by computing the L(L — l)/2 quotient images from all available image 
pairs and summing up the corresponding errors. This method allows to separate 
brightness changes due to surface shape from those caused by variable albedo. 
Similar to SFS with a single image and constant albedo, however, the values for 
q(u, v) obtained with this approach are quite uncertain as long as the illumination 
vectors are coplanar, so error term (5) should be combined with the shadow- 
based approach of Section 2 provided that corresponding shadow information is 
available. 



4 Experimental Results 

Fig. 2 illustrates the performance of the proposed algorithms on a synthetically 
generated object (Fig. 2a). The shadow-based technique outlined in Section 2 and 
utilizing intensity error term (2) reveals the surface gradients in image v direction 
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Fig. 2. Surface reconstruction of synthetic images, (a) Ground truth surface profile, 
(b) One shading and two shadow images (top, /x S fs = 175°, = 4 -°° , Mshldow = 

5.0°) along with surface profile (bottom) obtained according to Section 2. (c) Two 
shading images and true albedo map (top, a4 fs = 15°, Msfs = 170°) along with the 
reconstructed surface profile and albedo map (bottom) obtained by using the quotient- 
based error term (5) instead of single- image error term (2). 



(dashed circles). Traditional single-image SFS as outlined in Section 1.2 is not 
able to extract these surface gradients. As suggested in Section 3, the single- 
image error term (2) was then replaced by the quotient-based error term (5) for 
a reconstruction of the same synthetic object but now with a non-uniform albedo 
(Fig. 2c). Consequently, two shading images are used in combination with the 
shadow information. As a result, a similar surface profile is obtained, and the 
albedo is extracted at an accuracy (root mean square error) of 1.1 percent. 

For 3D reconstruction of regions on the lunar surface it is possible to use 
a Lambertian reflectance map because for small parts of the lunar surface, the 
relative Lambertian reflectance (an absolute calibration of the images is not 
necessary) differs by only a few percent from those values derived from more 
sophisticated models such as the Lunar-Lambert function [6] - the presented 
framework, however, can also be applied to non-Lambertian reflectance models. 

The CCD images were acquired with ground-based telescopes of 125 mm and 
200 mm aperture. Image scale is 800 m per pixel. 

Fig. 3 shows the reconstructed surface profile of the floor of lunar crater 
Theaetetus, generated by the technique outlined in Section 2. Both the simulated 
shading image and the shapes of the simulated shadows correspond well with 
their real counterparts. Even the ridge crossing the crater floor, which is visible in 
the upper left corner of the region of interest in Fig. la and in the Lunar Orbiter 
photograph in Fig. 3d shown for comparison, is apparent in the reconstructed 
surface profile (arrow). Furthermore, it turns out that the crater floor is inclined 
from the north to the south, and a very shallow central elevation rising to about 
250 m above floor level becomes apparent. This central elevation does not appear 
in the images in Fig. la used for reconstruction, but is clearly visible in the 
ground-based image acquired at higher solar elevation shown in Fig. 3e (left, 
lower arrow). The simulated image (right half of Fig. 3e) is very similar to the 
corresponding part of the real image although that image has not been used for 
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Fig. 3. Reconstruction result for the western part of lunar crater Theaetetus (see Fig. 1 
for original images), (a) Simulated regions of interest, (b) Reconstructed surface profile. 
The z axis is two-fold exaggerated, (c) Reconstructed surface profile with absolute 2 
values, (d) Lunar Orbiter image IV-110-H2 (not used for reconstruction), (e) Ground- 
based image, solar elevation /r = 28.7°, real image (left) and simulated image (right). 
This image has not been used for reconstruction, (f) Reconstruction result obtained 
with traditional SFS, selecting the solution consistent with the first shadow image. 




Fig. 4. Reconstruction result for the region around lunar dome Herodotus a>. (a) Orig- 
inal images. Solar elevation angles are /j,i = 5.0° and ^2 = 15.5°. (b) Albedo map. (c) 
Reconstructed surface profile. The slight overall bending of the surface profile reflects 
the Moon’s spherical shape. 



reconstruction. This kind of comparison is suggested in [2] as an independent 
test of reconstruction quality. For comparison, traditional SFS as outlined in 
Section 1.2 yields an essentially flat crater floor and no ridge (Fig. 3f). Here, the 
uniform surface albedo was adjusted to yield an SFS solution consistent with 
the first shadow image (for details cf. [8]). 

Fig. 4 shows the region around lunar dome Herodotus w, obtained with the 
quotient-based approach outlined in Section 3. The images are rectified due to 
the proximity of this region to the moon’s apparent limb. The reconstructed sur- 
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face profile contains several shallow ridges with altitudes of roughly 50 m along 
with the lunar dome, whose altitude was determined to 160 m. The resulting 
albedo map displays a gradient in surface brightness from the lower right to the 
upper left corner along with several ray structures running radially with respect 
to the crater Aristarchus. 

5 Summary and Conclusion 

In this paper, shape from shading techniques for 3D reconstruction of surfaces 
under coplanar light sources are proposed. The first presented method relies on 
an initialization of the surface gradients by means of the evaluation of a pixel- 
synchronous set of at least one shading image and two shadow images and yields 
reliable values also for the surface gradients perpendicular to the direction of 
illumination. The second approach is based on at least two pixel-synchronous, 
shadow- free images of the scene acquired under different illumination conditions. 
A shape from shading scheme relying on an error term based on the quotient of 
pixel intensities is introduced which is capable of separating brightness changes 
due to surface shape from those caused by variable albedo. A combination of 
both approaches has been demonstrated. 

In contrast to traditional photometric stereo approaches, both presented 
methods can cope with coplanar illumination vectors. They are successfully ap- 
plied to synthetically generated data and to the 3D reconstruction of regions 
on the lunar surface using ground-based CCD images. The described techniques 
should be as well suitable for space-based exploration of planetary surfaces. Be- 
yond the planetary science scenario, they are applicable to classical machine 
vision tasks such as surface inspection in the context of industrial quality con- 
trol. 
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Abstract. We propose an approach for pose estimation based on a 
multi-camera system with known internal camera parameters. We only 
assume for the multi-camera system that the cameras of the system have 
fixed orientations and translations between each other. In contrast to ex- 
isting approaches for reconstruction from multi-camera systems we intro- 
duce a rigid motion estimation for the multi-camera system itself using 
all information of all cameras simultaneously even in the case of non- 
overlapping views of the cameras. Furthermore we introduce a technique 
to estimate the pose parameters of the multi-camera system automati- 
cally. 



1 Introduction 

Robust scene reconstruction and camera pose estimation is still an active re- 
search topic. During the last twelve years many algorithms have been developed, 
initially for scene reconstruction from a freely moving camera with fixed calibra- 
tion [2] and later even for scene reconstruction from freely moving uncalibrated 
cameras [3]. All these approaches are using different self-calibration methods, 
which have been developed in the last decade, to estimate the internal calibra- 
tion of the camera. This self-calibration can be used to estimate the internal 
parameters of multi-camera systems (MCS). However, all these methods still 
suffer from ill-conditioned pose estimation problems which cause flat minima 
in translation and rotation error functions [4]. Furthermore the relatively small 
viewing angle is also a problem which influences the accuracy of the estima- 
tion [4]. Due to these problems we introduce a new pose estimation technique 
which combines the information of several rigidly coupled cameras to avoid the 
ambiguities which occur in the single camera case. In our novel approach we es- 
timate a rigid body motion for the MCS as a whole. Our technique combines the 
observations of all cameras to estimate the six degrees of freedom (translation 
and orientation in 3D-space) for the pose of the MCS. It exploits the fixed ro- 
tations and translations between the cameras of the MCS. These fixed rotations 
and translations are denoted as a configuration in the following. We also give a 
technique to determine these parameters automatically from an image sequence 
of the MCS. 

The paper is organized as follows. At first we discuss the previous work in pose 
estimation from a single camera or a MCS. Afterwards we introduce our novel 
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pose estimation approach. In section 4 we provide a technique to automatically 
estimate the configuration of the MCS. Furthermore in section 5 we show some 
experimental results to measure the robustness of our approach. 

1.1 Notation 

In this subsection we introduce some notations. The projection of scene points 
onto an image by a calibrated camera may be modeled by the equation x = PX. 
The image point in projective coordinates is x = [x x ,x y ,x w ] T , while X = 
[. X x ,X'y,X z ,X w ] T is the 3D-worlcl point in homogeneous co- 

ordinates and P is the 3x4 camera projection matrix. The matrix P is a rank-3 
matrix. If it can be decomposed as P = [R T \ — R T C], the P-matrix is called 
metric, where the rotation matrix R (orientation of the camera) and the trans- 
lation vector C (position of the camera) represent the Euclidian transformation 
between the camera coordinate system and the world coordinate system. 

2 Previous Work 

For a single moving camera, Fermiiller et. al. discussed in [4] the ambiguities for 
motion estimation in the three dimensional space. They proved that there were 
ambiguities in estimation of translation and rotation for one camera for all types 
of given estimation algorithms. These ambiguities result in flat minima of the cost 
functions. Baker et. al. introduced in [5] a technique to avoid these ambiguities 
when using a MCS. For each camera the pose is estimated separately and the 
ambiguities are calculated before the fusion of the ambiguous subspaces is used 
to compute a more robust pose of the cameras. In contrast to our approach the 
technique of [5] does not use one pose estimation for all information from all 
cameras simultaneously. There is some work in the area of polydioptric cameras 
[7] which are in fact MCSs with usually very small translations between the 
camera centers. In [8] a hierarchy of cameras and their properties for 3D motion 
estimation is discussed. It can be seen that the pose estimation problem is well- 
conditioned for an MCS in contrast to the ill-conditioned problem for a single 
camera. 

The calibration of a MCS is proposed in [5]. The line-based calibration ap- 
proach is used to estimate the internal and external parameters of the MCS. 
For a MCS with zooming cameras a calibration approach is introduced in [9,10]. 
An approach for an auto-calibration of a stereo camera system is given in [1]. 
Nevertheless, all standard calibration, pose-estimation and structure from mo- 
tion approaches for stereo camera systems exploit the overlapping views of the 
cameras, what is in contrast to our pose estimation approach, which does not 
depend on this. 

3 Pose Estimation for Multi-camera Systems 

In this section we introduce our novel approach for rigid motion estimation 
of the MCS. The only assumptions are that we have a MCS with an internal 
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calibration K, for each of the cameras and a fixed configuration. That assump- 
tion is valid for most of the currently used MCSs because all these systems are 
mounted on some type of carrier with fixed mount points. The computation of 
the configuration from the image sequence itself is introduced in section 4. The 
internal camera calibration A) can be determined using the techniques of [10,3]. 
For convenience we will always talk about A'-normalized image coordinates and 
P-matrices, therefore K, can be omitted for pose estimation. 

3.1 Relation Between World and Multi-camera System 

The general structure from motion approach uses an arbitrary coordinate sys- 
tem C wor id to describe the camera position by the rotation R of the camera, 
the position C of the camera center and the reconstructed scene. Normally the 
coordinate system C wor u is equivalent with the coordinate system of the first 
camera. In this case the projection matrix of camera i with orientation Ri and 
translation C,; is given by 

p i = [ R I\ - R fCi] ■ (1) 

For a multi camera-system we use two coordinate systems during the pose 
estimation. The absolute coordinate system C wor id is used to describe the posi- 
tions of 3D-points and the pose of the MCS in the world. The second coordinate 
system used, C r ,: s , is the relative coordinate system of the MCS describing the 
relations between the cameras (configuration). It has its origin at C v and it is 
rotated by R v and scaled isotropically by with respect to C wor id- 

Now we discuss the transformations between the different cameras of the 
MCS and the transformation into the world coordinate system C wor id- Without 
loss of generality we assume all the translations AC, and rotations ARi of the 
cameras are given in the coordinate system C r i g . Then with (1) the camera 
projection matrix of each camera in C r i g is given by 

= [ARj\-ARjAC i }. (2) 

The position Cj of camera i and the orientation R, in C wor id is given by 

Ci = C v + R v ACi , R, = R V AR„ (3) 

Ay 

where translation C v , orientation R v and scale X v are the above described rela- 
tions between the MCS coordinate system C r i g and the world coordinate system 
Cworid- Then the projection matrix of the camera i in C wor id is given by 

P z = ARjRT\-ARjRl(C v +^R v ACi) . (4) 

Ay 

With (3) we are able to describe each camera’s position in dependence of the 
position and orientation of camera i in the coordinate system of the multi-camera 
system C rig and the pose of the MCS in the world C wor id- Furthermore with (4) 
we have the transformation of world points X into the image plane of camera i 
in dependence of the position and orientation of the MCS and the configuration 
of the MCS. 
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3.2 Virtual Camera as a Representation of a Multi-camera System 

We now introduce a virtual camera as a representation of the MCS, which is used 
to determine the position of the MCS in C wor id independent of its configuration. 

The virtual camera v which represents our MCS is at the origin of the coor- 
dinate system C r , ff and is not rotated within this system. It follows immediately 
that it has position C v and orientation R v in C wor id because it is rotated and 
translated in the same manner as the MCS. With (1) the projection matrix P v 
of the virtual camera v is 



Pv = [Rv\ - PyCv] , 



( 5 ) 



where rotation R v and position C v are the above given rotation and position of 
the MCS. From (4) and (5) it follows that the projection matrix P,; of camera i 
depends on the virtual camera’s projection matrix P v : 



Pi = ARj P, 





( 6 ) 



3.3 Pose Estimation of the Virtual Camera 



Now we propose a pose estimation technique for the virtual camera using the 
observations of all cameras simultaneously. The image point Xi in camera i of 
a given 3D-point X is given as Xi = PiX, where Xi £ P 2 , X £ P 3 and = is 
the equality up to scale. With equation (6) the image point Xi depends on the 
virtual camera’s pose by 



Xi = PiX = art 






X , 



( 7 ) 



For a MCS with known configuration, namely camera translations ACi, camera 
orientations ARi and scale A t ,, this can be used to estimate the virtual camera’s 
position C v and orientation R v in dependence of the image point Xi in camera i 
as a projection of 3D-point X. 

Now we deduce a formulation for the estimation of the virtual camera’s po- 
sition C v and orientation R v given the translations ACi, orientations ARi, and 
scale A„ of the cameras of the MCS. From (7) we get 



ARtXi^ P V X - ^ACi, 



where X = [X x , X y , X z , X W ] T £ P 3 is the 3D-point in the 3D projective space. 
Using the same affine space for Xi and x leads to the following linear equations 



X x xf(P v ) 3 , 1 + xyxUP.h, 2 + X z X*(Pvh, 3 + X W X X (P V ) 3 ,4 
(x x xT(P v )i, 1 + xyxT(P v )i,2 + X*x?(p v ) h3 + x w xf(p v ) 1A ) 
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= (ACi)sX w xt - (AC i ) 1 X v ’x?’, (8) 

X x xV{P v ) 3A + X«X?(P V ) 3 , 2 + X z X V z (Pvh, 3 + X W xV{P v ) 3,4 
- (X x x?(P v ) 2 , 1 + X^"'(P V ) 2 ,2 + X^(P,) 2i3 + X w xf(P v ) 2A ) 

= (. ACi ) 3 X w x? - (ACi) 2 X w xr (9) 

in the entries of P v with Xi = [xf , x^ , x™] T and ACi = j~ACi . 

Note that the above equations are a generalization of the case of a single 
camera which can be found in [1] and analogous methods to those given in [1] 
can be used to estimate P v from these equations and to finally extract the 
unknown orientation R v and the unknown position C v . The extension for the 
MCS is that the rotation compensated image points a h are used and terms for 
the translation AC, of camera i in the multi-camera coordinate system C r i g are 
added. In the case of pose estimation for a single camera using our approach it is 
assumed without loss of generality that the coordinate system C rig is equivalent 
to the camera’s coordinate system. Then ACi vanishes and the rotation AR j 
is the identity. In this case (8) and (9) are the standard (homogeneous) pose 
estimation equations from [1], 

4 Calibration of the Multi-camera System 

In the previous section we always assumed that we know the orientation ARi 
and translation AC, of each camera in the coordinate system C r i g and the scale 
X v between C r i g and C wor id ■ In this section we present a technique to estimate 
these parameters from the image sequence of a MCS with overlapping views. 
However, note that the simultaneous pose estimation of the MCS itself does not 
depend on overlapping views, once the configuration is known. Suppose we are 
given n cameras in the MCS and grab images at time to- After a motion of 
the MCS (time ti), we capture the next image of each camera. We now have 
2 n frames with overlapping views, for which a standard structure-from-motion 
approach (for example as described in [6]) for single cameras can be applied to 
obtain their positions and orientations. 

For each of the two groups of n cameras (the MCS at t 0 and t\) the virtual 
camera is set to the first camera of the system. Then the rigid transformations 
for the other cameras are computed and averaged, which yields an initial approx- 
imate configuration of the system. In order to obtain a mean rotation we use 
the axis-angle representation, where axes and angles are averaged arithmetically 
with respect to their symmetries. If C wor id is defined to be the coordinate system 
of the estimated single cameras, it follows immediately that has to be set to 
1 since C r i g already has the correct scale. 

To improve precision the estimate of the configuration is iteratively refined: 
For each new pose of the system the pose of each single camera is revised with 
respect to the points seen by that camera. Afterwards the configuration of the 
refined cameras is computed and averaged with the previously estimated con- 
figurations. Since the combined camera system pose estimation is somewhat 
sensitive to noise in the configuration parameters, this is more robust. 




Pose Estimation for Multi-camera Systems 



291 




(a) 




(b) (c) (d) 



Fig. 1 . Dependency of the standard deviation of the feature position noise in pixel (a) 
the mean of the norm of camera center error, (b) the standard deviation of the latter 
error, (c) the absolute value of the angular error of the cameras orientation, (d) the 
standard deviation of the latter error. 
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(c) 
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(d) 



Fig. 2. Dependency of the standard deviation of the noise in the MCS configuration 
(a) the mean of the norm of the camera center error, (b) the standard deviation of the 
norm camera center error, (c) the absolute value of the angular error of the cameras 
orientation, (d) the standard deviation of the angular error of the cameras orientation. 



5 Experiments 

In this section the introduced estimation techniques for the pose of a MCS are 
evaluated. First we measure the robustness of the technique with synthetic data. 
Afterwards we use image sequences generated by a simulator and compare our 
results with the given ground truth data. Finally we also present experimental 
results for a real image sequence. To measure the noise robustness of our novel 
pose estimation technique we use synthetic data. The MCS is placed in front of 
a scene consisting of 3D-points with given 2D image points in the cameras of 
the MCS. At first we disturb the 2D correspondences with zero- mean Gaussian 
noise for each image. Afterwards we use our approach to estimate the pose of the 
virtual camera, with a least squares solution based on all observed image points. 
The norm of the position error and the angle error of the estimated orientation 
can be seen in figure (1). It can be seen that the proposed pose estimation is 
robust with respect to the pixel location error of up to 1 pixel noise. 

In a second test we disturb the configuration of the MCS with a zero-mean 
Gaussian translation error (with sigma of up to 5% of the camera’s original 
displacement) and a Gaussian rotation error of up to 0.35 degrees in each axis. 
It can be seen that the proposed pose estimation technique is robust against these 
disturbances but the configuration errors cause higher errors in the estimated 
pose than the noise in the feature positions does. 
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(a) (b) (c) 



(d) 



Fig. 3. (a),(b): Non-overlapping simulator images of MCS (c),(d): error of relative 
translation and rotation since previous estimate w.r.t. to ground truth (17 image pairs) 
for standard structure from motion and MCS structure from motion. 




(a) (b) (c) (d) 



Fig. 4. Museum scene: (a) overview image, (b) reconstructed scene points and cameras, 
(c) relative corrections of centers in C r i g , (d) incremental optical axes rotations. The 
sequence starts in front of the arc to the left, moves parallel to some wide stairs and 
finishes in front of the other arc to the right. 25 times 4 images have been taken. 



In order to measure the pose estimation errors of the proposed approach in a 
structure-from-motion framework, we use a sequence of rendered images (see fig. 
3) with ground truth pose data. In this sequence a MCS with two fixed cameras 
with non-overlapping views is moved and rotated in front of a synthetic scene. We 
implemented a pose estimation algorithm with the following steps: Given a set of 
Harris corners and corresponding 3d points in an initial image 1.) in each image 
a Harris corner detector is used to get feature positions, 2.) from the corners a set 
of correspondences is estimated using normalized cross correlation and epipolar 
geometry, 3.) using these correspondences (and the referring 3d points) the pose 
is estimated with RANSAC using eq. (8) and (9), 4.) afterwards a nonlinear 
optimization is used to finally determine the pose of the MCS. The measured 
position and orientation errors are shown and compared to a single camera pose 
estimation in fig. 3. It can be seen that using the MCS pose estimation the 
rotation is estimated with a smaller error than in the single camera case, but 
the translation estimatates for a single camera is slightly better for this data. 

Now we show that the pose estimation also works well on real images. The 
images used have been taken at the National History Museum in London us- 
ing a MCS with four cameras mounted on a pole. The configuration has been 
computed from the image data as described in the previous section. Using 
standard single-camera structure-from-motion approaches, the pose estimation 
breaks down in front of the stairs. Due to the missing horizontal structure at 
the stairs there are nearly no good features. However, incorporating all cameras 
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in our approach makes the pose estimation robust exactly in those situations, 
where some of the cameras can still see some features. 

Using our approach to compute the MCS configuration the initial estimates 
for the centers are refined by about five to eight percent in C r , s compared to the 
finally stable values. After about the seventh pose estimate the center change 
rate reaches one percent. It is interesting that although the parameters for the 
second camera are not estimated very well, the system does work robustly as a 
whole. 

6 Conclusions 

We introduced a novel approach for pose estimation of a multi-camera system 
even in the case of non-overlapping views of the cameras. Furthermore we in- 
troduced a technique to estimate all parameters of the system directly from the 
image sequence itself. The new approach was tested under noisy conditions and 
it has been seen that it is robust. Finally we have shown results for real and 
synthetic image sequences. 
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Abstract. In this contribution we present an algorithm for 2D-3D pose 
estimation of human beings. A human torso is modeled in terms of free- 
form surface patches which are extended with joints inside the surface. 
We determine the 3D pose and the angles of the arm joints from image 
silhouettes of the torso. This silhouette based approach towards human 
motion estimation is illustrated by experimental results for monocular 
or stereo image sequences. 



1 Introduction 

Modeling and tracking of human motion from video sequences is an increasingly 
important field of research with applications in sports sciences, medicine, anima- 
tion (avatars) or surveillance. E. Muybridge is known as the pioneer in human 
motion capturing with his famous experiments in 1887 called Animal Locomo- 
tion. In recent years, many techniques for human motion tracking have been 
proposed which are fairly effective [2,5], but they often use simplified models of 
the human body by applying ellipsoidal, cylindrical or skeleton models and do 
not use a realistic surface model. The reader is referred to [4] for a recent survey 
on marker-less human motion tracking. 

In this work we present and discuss a human motion capturing system which 
estimates the pose and angle configuration of a human body captured in image 
sequences. Contrary to other works we apply a 2-parametric surface representa- 
tion [3], allow full perspective camera models, and use the extracted silhouette 
of the body as the only image information. Our algorithms are fast (400ms per 
frame), and we present experiments on monocular and stereo image sequences. 
The scenario is visualized in the left of figure 1. As it can be seen, we use a 
model of the human torso with its arms, and model it by using two free- form 
surface patches. The first patch (modeling the torso) contains 57 x 21 nodes and 
the second (modeling the arms) contains 81 x 21 nodes. Each arm contains 4 
joints, so that we have to deal with 8 joint angles and 6 unknowns for the rigid 
motion resulting in 14 unknowns. The right of figure 1 gives the names of the 
used joints for the diagrams in the experiments. 
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Shoulder (up) 
Shoulder (back) 



Wrist 



Elbow 



Fig. 1 . Left: The pose scenario: the aim is to estimate the pose R, t and the joint 
angles tj>i. Right: The names of the used joints. 



This contribution continues work reported in [7,8] on point-based, contour- 
and surface-based pose estimation. But whereas in these works only rigid objects 
are discussed or only a point-based representation scheme for modeling kinematic 
chains is used, in this work we want to overcome these previous limitations by 
introducing an approach for pose estimation of free- form surfaces, coupled with 
kinematic chains. It is applied to marker-less human motion tracking. We start 
with recalling foundations and introduce twists which are used to model rigid 
motions and joints. Then we continue with free-form contours and surfaces, 
and define a basic approach for pose estimation, followed by extensions and 
experiments. We conclude with a brief discussion. 



2 Foundations 

Clifford or geometric algebras [9] can be used to deal with geometric aspects 
of the pose problem. We only list a few properties which are important for our 
studies. The elements in geometric algebras are called multivectors which can be 
multiplied by using a geometric product. It allows a coordinate-free and dense 
symbolic representation. For modeling the pose problem, we use the conformal 
geometric algebra (CGA). The CGA is build up on a conformal model which 
is coupled with a homogeneous model to deal with kinematics and projective 
geometry simultaneously. In conclusion, we deal with the Euclidean, kinematic 
and projective space in a uniform framework and can therefore cope with the 
pose problem in an efficient manner. In the equations we will use the inner 
product, •, the outer product, A, the commutator, x, and anticommutator, x, 
product, which can be derived from the geometric product. Though we will also 
present equations formulated in conformal geometric algebra, we only explain 
these symbolically and want to refer to [7] for more detailed information. 
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2.1 Point Based Pose Estimation 

For 2D-3D point based pose estimation we use constraint equations which com- 
pare 2D image points with 3D object points. Assume an image point x and the 
optical center O. These define a 3D projection ray, L x = e A (O A x), as Plucker 
line [6]. The motor M is defined as exponential of a twist \P, M = exp(— f'?'), 
and formalizes the unknown rigid motion as a screw motion [6] . The motor M is 
applied on an object point X as versor product, X 7 = AFX AT, where M rep- 
resents the so-called reverse of M . Then the rigidly transformed object point, 
X 1 , is compared with the reconstructed line, L x , by minimizing the error vector 
between the point and the line. This specifies a constraint equation in geometric 

algebra: 

(MXM) x (e A (O A as)) = 0. 

Note, that we deal with a 3D formalization of the pose problem. The constraint 
equations can be solved by linearization (i.e. solving the equations for the twist- 
parameters which generate the screw motion) and by applying the Rodrigues 
formula for a reconstruction of the group action [6] . Iteration leads to a gradient 
descent method in 3D space. This is presented in [7] in more detail, where similar 
equations have been introduced to compare 3D points with 2D lines (3D planes) 
and 3D lines with 2D lines (3D planes). Pose estimation can be performed in 
real-time and we need 2ms on a Linux 2GHz machine to estimate a pose based 
on 100 point correspondences. 

Joints along the kinematic chain can be modeled as special screws with no 
pitch. In [7] we have shown, that the twist then corresponds to a scaled Plucker 
line, = 0L in 3D space, which gives the location of the general rotation. 
Because of this relation it is simple to move joints in space and they can be 
transformed by a motor M in a similar way such as plain points, = MWM . 

2.2 Contour-Based Pose Estimation 

We now model free-form contours and their embedding into the pose problem. 
As it turned out, Fourier descriptors are very useful, since they are a special 
case of so-called twist- generated curves which we used to model cycloidal curves 
(cardioids, nephroids and so forth) within the pose problem [7]. The later intro- 
duced pose estimation algorithm for surface models goes back onto a contour 
based method. Therefore, a brief recapitulation of our former works on con- 
tour based pose estimation is of importance. The main idea is to interpret a 
1-parametric 3D closed curve as three separate ID signals which represent the 
projections of the curve along the x, y and 2 axis, respectively. Since the curve 
is assumed to be closed, the signals are periodic and can be analyzed by ap- 
plying a ID discrete Fourier transform (1D-DFT). The inverse discrete Fourier 
transform (1D-IDFT) enables us to reconstruct low-pass approximations of each 
signal. Subject to the sampling theorem, this leads to the representation of the 
1-parametric 3D curve C(<j>) as 

m = 1 k= — N 
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The parameter in represents each dimension and the vectors p™ are phase 
vectors obtained from the 1D-DFT acting on dimension m. In this equation we 
have replaced the imaginary unit i = \f — T by three different rotation planes, 
represented by the bivectors li, with l 2 = —1. Using only a low-index sub- 
set of the Fourier coefficients results in a low-pass approximation of the object 
model which can be used to regularize the pose estimation algorithm. For pose 
estimation this model is then combined with a version of an ICP-algoritlnn [10]. 



2.3 Silhouette-Based Pose Estimation of Free-Form Surfaces 

To model surfaces, we assume a two-parametric surface [3] of the form 

3 

F((j)l,(j)2) = i,<h) e i> 

i— 1 

with three 2D functions '■ R 2 —> R acting on the different Euclidean 

base vectors e, (i = 1, . . . ,3). The idea behind a two-parametric surface is to 
assume two independent parameters (j > i and <f > 2 to sample a 2D surface in 3D 
space. For a discrete number of sampled points, f^ tl na , (m £ [— Ni, IVi]; «2 £ 
[ — A? 2 , iV 2 ]; A r i , N ‘2 £ IN, i = 1, . . . , 3) on the surface, we can now interpolate 
the surface by using a 2D discrete Fourier transform (2D-DFT) and then apply 
an inverse 2D discrete Fourier transform (2D-IDFT) for each base vector sepa- 
rately. Subject to the sampling theorem, the surface can be written as a Fourier 
representation, 



F{(t> i,(fa) 



3 JVi n 2 

E E E *4,* 2 ex p 

i = 1 ki=— Ni /c2 = — N 2 



2Ni + 1 



li exp 



2nk2<j>2 , \ 

2 N 2 + 17 



The complex Fourier coefficients are contained in the vectors p ki ka that lie in 
the plane spanned by Zj. We will again call them phase vectors. These vectors 
can be obtained by a 2D-DFT of the sample points /£ 2 on the surface. We 

now continue with the algorithm for silhouette-based pose estimation of surface 
models. 



Surface based pose estimation 

Recc.tsLruc- pro lec-ion rays Iron inapt pci.ils 
^Project tbs low-pass abrerf mariel - r T.he v rt,:sl i Tage 
Far. inm.e r.hf- 3D si 1 hnuett-e 
Apply contour based pose estimation algorithm 

^Estimate _he nearest po_r.L on the liD contour to each 
■ use oorrespor.aer.ee set to estimate the contour pose 
Tran a Fa '•is t.he cor. .our mode' 

• Tran a Lorn _he surface inode 

— increc.se lew pass approximation cf the surface mc-iel 




Fig. 2. Left: The algorithm for pose estimation of surface models. Right: A few example 
images of a tracked car model on a turn-table. 
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Tracking assumption Correspondence estimation Pose estimation 




Iteration 



Fig. 3. The basic algorithm: Iterative correspondence and pose estimation. 

We assume a properly extracted silhouette (i.e., in a frame of the sequence) 
of our object (i.e., the human body). To compare points on the image silhouette 
we consider rim points on the surface model (i.e., which are on an occluding 
boundary of the object). This means we work with the 3D silhouette of the 
surface model with respect to the camera. To obtain this, we project the 3D 
surface on a virtual image. Then the contour is calculated and from the image 
contour the 3D silhouette of the surface model is reconstructed. The contour 
model is then applied within the contour-based pose estimation algorithm. Since 
aspects of the surface model are changing during ICP-cycles, a new silhouette 
will be estimated after each cycle to deal with occlusions within the surface 
model. The algorithm for pose estimation of surface models is summarized in 
figure 2 and it is discussed in [8] in more detail. 

3 Human Motion Estimation 

We now introduce how to couple kinematic chains within the surface model 
and present a pose estimation algorithm which estimates the pose and angle 
configurations simultaneously. 

A surface is given in terms of three 2-parametric functions with respect to 
the parameters <pi and <f> 2 - Furthermore, we assume a set of joints J;. By using 
an extra function J(t\> 1 ,^ 2 ) — ► [J%\Ji '■ <th. joint], we are able to give every node 
a joint list along the kinematic chain. Note, that we use [, ] and not {, }, since the 
joints are given ordered along the kinematic chain. Since the arms contain two 
kinematic chains (for the left and right arm separately), we introduce a further 
index to separate the joints on the left arm from the ones on the right arm. The 
joints themselves are represented as objects in an extra field (a look-up table) 
and their parameters can be accessed immediately from the joint index numbers. 
Furthermore, it is possible to transform the location of the joints in space (as 
clarified in section 2). For pose estimation of a point X„ i attached to the nth 
joint, we generate constraint equations of the form 

(M(Mi . . . M n X ni JvT n . . . M[)M) xeA(OA x n , in ) = 0. 

To solve a set of such constraint equations we linearize the motor M with 
respect to the unknown twist W and the motors Mi with respect to the unknown 
angles 0i. The twists are known a priori. 
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Fig. 4. Left: First pose results with a 6 DOF kinematic chain. Right: Angles of the left 
and right arm during the tracked image sequence. 




Fig. 5. First pose results with a 8 DOF kinematic chain. 



The basic pose estimation algorithm is visualized in figure 3: We start with 
simple image processing steps to gain the silhouette information of the person 
by using a color threshold and a Laplace operator. Then we project the surface 
mesh in a virtual image and estimate its 3D contour. Each point on the 3D 
contour carries a given joint index. Then we estimate the correspondences by 
using an ICP-algorithm, generate the system of equations, solve them, transform 
the object and its joints and iterate this procedure. During iteration we start 
with a low-pass object representation and refine it by using higher frequencies. 
This helps to avoid local minima during iteration. 

First results of the algorithm are shown on the left of figure 4: The figure 
contains two pose results; it shows on each quadrant the original image and 
overlaid the projected 3D pose. The other two images show the estimated joint 
angles in a virtual environment to visualize the error between the ground truth 
and the estimated pose. The tracked image sequence contains 200 images. In 
this sequence we use just three joints on each arm and neglect the shoulder 
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Fig. 6. The silhouette for different arm poses of the kinematic chain. 




Fig. 7. Example images of a first stereo experiment. 



(back) joint. The right diagram of figure 4 shows the estimated angles of the 
joints during the image sequence. The angles can easily be identified with the 
sequence. Since the movement of the body is continuous, the estimated curves 
are also relatively smooth. 

Then we extend the model to a 8DOF kinematic chain and add a joint on 
the shoulder which allows the arms to move backwards and forwards. Results 
of the same sequence are shown in figure 5. As it can be seen, the observation 
of the pose overlaid with the image data appear to be good, but in a simu- 
lation environment it can be seen, that estimated joints are quite noisy. The 
reason for the depth sensitivity lies in the used image information: Figure 6 
shows two images of a human with different arm positions. It can be seen, that 
the estimated silhouettes look quite similar. This means, that the used image 
features are under-determined in their interpretation as 3D pose configuration. 
This problem can not be solved in an algorithmic way and is of geometric nature. 
To overcome this problem we decided to continue with a stereo setup. The basic 
idea is, that the geometric non-uniquenesses can be avoided by using several 
cameras observing the scene from different perspectives. Since we reconstruct 
rays from image points, we have to calibrate the cameras with respect to one 
fixed world coordinate system. Then it is unimportant for which camera a ray is 
reconstructed and we are able to combine the equations from both cameras into 
one system of equations and estimate the pose and arm angles simultaneously. 
Figure 7 shows example images of our stereo implementation. In each segment, 
the left images show the original and filtered image of each camera. The middle 
images show pose results in both cameras and the right images show the pose 
results in a virtual environment. It can be seen that the results improved. 
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4 Discussion 

This contribution presents an approach for silhouette-based pose estimation of 
free-form surfaces coupled with kinematic chains. We use our previous work, 
dealing with 2D-3D pose estimation of points, free-form contours and free-form 
surfaces and describe how to extend the approach to kinematic chains. In the 
experiments it turns out that pure silhouette information is not sufficient for 
accurate pose estimation since the extracted silhouette and its interpretation as 
3D pose is under-determined. Therefore, we move on to a multi-view scenario 
and illustrate that the pose results can be improved remarkably in a stereo setup. 
Experiments have been done with image sequences between 100 and 500 frames. 
Further work will continue with an extension of the multi-camera set-up. 

Acknowledgments. This work has been supported by the EC Grant IST-2001- 
3422 (VISATEC) and by the DFG grant RO 2497/1-1. 
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Abstract. This paper presents a cooperative optimization algorithm 
for energy minimization in a general form. Its operations are based on 
parallel, local iterative interactions. This algorithm has many important 
computational properties absent in existing optimization methods. Given 
an optimization problem instance, the computation always has a unique 
equilibrium and converges to it with an exponential rate regardless of 
initial conditions. There are sufficient conditions for identifying global 
optima and necessary conditions for trimming search spaces. To demon- 
strate its power, a case study of stereo matching from computer vision 
is provided. The proposed algorithm does not have the restrictions on 
energy functions imposed by graph cuts [1,2], a powerful specialized opti- 
mization technique, yet its performance was comparable with graph cuts 
in solving stereo matching using the common evaluation framework [3]. 



1 Introduction 

Stereo matching is one of the most active research areas in computer vision [3, 
1,4,5]. The goal of stereo matching is to recover the depth image of a scene 
from a pair of 2-D images of the same scene taken from two different locations. 
Like many other problems from computer vision, it can be formulated as the 
global optimization of multivariate energy functions, which is NP-lrarcl [6] in 
computational complexity. 

The general methods [7] for combinatorial optimization are 1) local search [7], 
2) Simulated Annealing [8], 3) genetic algorithms [9], 4) tabu search, 5) Branch- 
and-Bound [10,11] and 6) and Dynamic Programming [11]. The first four meth- 
ods are classified as local optimization, thus having the local optimum problem. 
The remaining two methods do not scale well when they come to dealing with 
thousands to millions of variables in most vision problems. 

On the specialized optimization algorithm side, if the energy function is in 
some special form, such as having binary variables and binary smoothness con- 
straints [1,2], the energy minimization can be converted into the problem of find- 
ing the minimum cut in a graph which has known polynomial algorithms to solve 
it. Those algorithms have also been generalized [1,2] as approximate algorithms 
for energy functions with multi-valued variables and regular constraints [2] . They 
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are called the graph cut algorithms in [1,3,2], a powerful specialized optimiza- 
tion technique popular in computer vision. They have the best known results 
in energy minimization in the two recent evaluations of stereo algorithms [3, 
12]. However, their convergence properties, such as the convergence rate and 
uniqueness of solutions, of those algorithms are not established. 

The cooperative algorithm presented in this paper is a general optimization 
technique. It does not share the restrictions on energy functions as imposed by 
graph cuts. It also does not have the local minimum problem and can handle 
problems with ten thousands or hundred thousands of variables in practice. The 
proposed algorithm has a clearly defined energy function in a general form. Its 
operations are based on parallel, local iterative interactions. Hence, it can be 
implemented using one-layered recurrent neural network. 

Unlike many existing optimization techniques, our algorithm has a solid theo- 
retical foundation, including convergence property. With the new set of difference 
equations, it guarantees both the existence and the uniqueness of solutions as 
well as an exponential convergence rate. It knows if the solution is the global 
optimum or not, and the quality of solutions. It has been generalized using the 
lattice concept from abstract algebra to be more powerful and complete [13]. It 
has also been generalized to cover the classic local search as its special case by 
extending its cooperation schemes [14]. To demonstrate its power, we will show 
in this paper that the proposed algorithm has a performance comparable with 
graph cuts in handling regular binary constraints using the common evaluation 
framework for stereo matching [3]. 

2 The Cooperative Optimization Algorithm 

To solve a hard combinatorial optimization problem, we follow the divide-and- 
conquer principle. We first break up the problem into a number of sub-problems 
of manageable sizes and complexities. Following that, we solve them together 
in a cooperative way so that the original energy function is minimized. The 
cooperation is achieved by asking each sub-problem solver, termed agent, to 
compromise its solution with the solutions of other sub-problem solvers. Hence, 
the algorithm uses a system of multi-agents working together cooperatively to 
solve an optimization problem. 

To be more specific, let E{x i, £ 2 , . . . , x n ) be a multivariate energy function, 
or simply denoted as E{x), where each variable Xi has a finite domain D,; of 
size rrii ( m, = |Dj|). We break the function into n sub-energy functions Ei (i = 
1 , 2 ,..., n), such that Ei contains at least variable ay for each i, the minimization 
of each energy function E t (the sub-problem) is computational manageable in 
sizes and complexities, and JT Ei = E(x). 

For example, the binary constraint-based function 

E{x\, X 2 , • • * , Xn) — ^ ) Ci (ay) T ^ ) C'ij (Xj , Xj ) , (1) 

i 

is a very popular energy function used in computer vision, where Ci is a unary 
constraint on variable ay and Cij is a binary constraint on variable ay and Xj . A 
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straight-forward decomposition of this energy function is: 

Ei = Ci{xi) + ^2 Cij( x i, Xj) for i = 1, 2, . . . , n . (2) 

j, 

The n sub-problems can be described as: 

min Ei, for * = 1, 2, . . . , n ± (3) 

Xj (zXi 

where Xj is the set of variables that sub-energy function Ei contains. 

Because of the interdependence of the sub-energy functions, as in the case of 
the binary constraint-based function (see Eq. (1)), minimizing those sub-energy 
functions in such an independent way can hardly yield a consensus in variable 
assignments. For example, the assignment for ay that minimizes Ei can hardly 
be the same as the assignment for the same variable that minimizes Ej if Ej 
contains a y. We need to solve those sub-problems in a cooperative way so that 
we can reach a consensus in variable assignments. 

To do that, we can break the minimization of each sub-energy function (see 
(3)) into two steps, 



min min Ei, for i = 1 , 2, . . . , n , 

Xi Xj£Xi\xi 

where Xj \ Xi denotes the set Xj minuses {xi}. 

That is, first we optimize Ei with respect to all variables that Ei contains 
except Xi . This gives us the intermediate solution in optimizing Ei, denoted as 
Ci(Xi), 

Cj(xj) = min Ei for i = 1, 2, . . . , n . (4) 

Xj £Xi\xi 

Second, we optimize Cj(ay) with respect to ay, 

minty (ay) , (5) 

Xi 

The intermediate solutions of the optimization, ty(ay), is an unary constraint 
introduced by the algorithm on the variable ay, called the assignment constraint 
on variable ay. Given a value of ay, Cj(ay) is the minimal value of Ei. To minimize 
Ei, those values of ay which have smaller assignment constraint values Cj(ay) are 
preferred more than those which have higher ones. 

To introduce cooperation in solving the sub-problems, we add the unary 
constraints Cj(xj), weighted by a real value A, back to the right side of (4) and 
modify the functions (4) to be iterative ones: 

Cj fc) (ay) = min ( (1 - X k ) Ej + X k y^ i w i jcf~ 1 \xj) ) for i = 1 , 2 , . . . , n , 

xjeXi\xi l / 

, (6) 

where k is the iteration step, Wij are non-negative weight values satisfying 
J2i Wij = 1. It has been found [13] that such a choice of makes sure the 
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iterative update functions converge. The energy function at the right side of the 
equation is called the modified sub-energy function, denoted as E t . 

By adding back Cj(xj) to Ei, we ask the optimization of Ei to compromise 
its solution with the solutions of the other sub-problems. As a consequence, the 
cooperation in the optimization of all the sub-energy functions (A,;s) is achieved. 
This optimization process defined in (6) is called the cooperative optimization 
of the sub-problems. 

Parameter A& in (6) controls the level of the cooperation at step k and is 
called the cooperation strength, satisfying 0 < A*, < 1. A higher value for Afc 
in (6) will weigh the solutions of the other sub-problems Cj(xj) more than the 
one of the current sub-problem Ej. In other words, the solution of each sub- 
problem will compromise more with the solutions of other sub-problems. As a 
consequence, a higher level of cooperation in the optimization is reached in this 
case. 

The update functions (6) are a set of difference equations of the assign- 
ment constraints Ci(xi). Unlike conventional difference equations used by prob- 
abilistic relaxation algorithms [15], cooperative computations [5], and Hopfield 
Networks [16], this set of difference equations always has one and only one equi- 
librium given A and Wtj. The computation converges to the equilibrium with an 
exponential rate, A, regardless of initial conditions of cf\xi). Those computa- 
tional properties will be shown in theorems in the next section and their proofs 
are provided in [14]. 

By minimizing the linear combination of Ej and Cj(xj), which are the inter- 
mediate solutions for other sub-problems, we can reasonably expect that a con- 
sensus in variable assignments can be reached. When the cooperation is strong 
enough, i.e., A& — > 1, the difference equations (6) are dominated by the assign- 
ment constraints Cj(xj), it appears to us that the only choice for Xj to minimize 
the right side of (6) is the one that has the minimal value of the assignment 
constraint Cj(xj) for any Ei that contains Xj. That is a consensus in variable 
assignments. 

Theory only guarantees the convergence of the computation to the unique 
equilibrium of the difference equations. If it converges to a consensus equilibrium, 
the solution, which is consisted of the consensus assignments for variables, must 
be the global optimum of the energy function E{x), guaranteed by theory (detail 
in the next section). However, theory doesn’t guarantee the equilibrium to be a 
consensus, even by increasing the cooperation strength A. Otherwise, NP=P. 

In addition to the cooperation scheme for reaching a consensus in variable 
assignments, we introduce another important operation of the algorithm at each 
iteration, called variable value discarding. A certain value for a variable, say x 2 , 
can be discarded if it has a assignment constraint value, Ci(xi) that is higher than 
a certain threshold, Ci(xi) > ti , because they are less preferable in minimizing Ei 
as explained before. There do exist thresholds from theory for doing that [13]. 
Those discarded values are those that can not be in any global optimal solution. 
By discarding values, we can trim the search space. If only one value is left for 
each variable after a certain number of iterations using the thresholds provided 
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theory, they constitute the global optimal solution, guaranteed by theory [13]. 
However, theory does not guarantee that one value is left for each variable in all 
cases. Otherwise, NP=P. 

By discarding values, we increase the chance of reaching a consensus equilib- 
rium for the computation. In practice, we progressively tighten the thresholds to 
discard more and more values as the iteration proceeds to increase the chance 
of reaching a consensus equilibrium. In the end, we leave only one value for each 
variable. Then, the final solution is a consensus equilibrium. 

However, by doing that, such a final solution is not guaranteed to be the 
global optimum. Nevertheless, in our experiments in solving large scale combina- 
torial optimization problems, we found that the solution quality of this algorithm 
is still satisfactory, much better than that of other conventional optimization 
methods, such as simulated annealing and local search [13]. 

3 Experiments and Results 

The proposed cooperative algorithm serves as a general problem solver in a 
unified computational model for understanding and solving all kinds of vision 
tasks. In general, it can find correct solutions for any vision task. It was found 
that any vision task (object recognition, shape from x, image segmentation, 
and more), can be represented as a mapping from an input space to an output 
space (e.g., from stimulus to interpretation). If there is no ambiguity in the 
mapping, or in other words there is a unique interpretation given an input set, it 
is guaranteed by theory the existence of a constraint model defining the mapping. 
The constraint calculus, offering a powerful knowledge representation framework 
for defining the constraint model, is a general extension of the tuple relational 
calculus, where the Boolean computing is extended to soft computing. The tuple 
relational calculus together with functions has the same expressive power as the 
Turing machine. Furthermore, the cooperative algorithm guarantees to find the 
correct solutions in theory when the noise level introduced at the input is limited 
(detail to be presented at Operations Research 2004). In this paper, we offer the 
case study of stereo matching. 

For detail about the energy function definitions used for stereo matching in 
the framework, please see [3]. Basically, a unary constraint Ci(xi) in (1) mea- 
sures the difference of the intensities between site i from one image and its 
corresponding site in another image given the depth of the site. A binary con- 
straint Cij(Xi,Xj) measures the difference of the depths between site i and site 
j. This type of constraints is also referred as the smoothness constraint in lit- 
eratures. It is also widely used in solving image segmentation and other vision 
tasks. 

From information-theoretical point of view, a binary constraint contain less 
information than a high-arity constraint for solving vision problems. Hence, high 
arity constraints lead to better quality of solutions. Most literatures deal with 
only the binary smoothness constraint because of the limitations of many con- 
ventional optimization techniques. The algorithm proposed in this paper doesn’t 
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have such a restriction. It was found that replacing the smoothness constraint 
by a 9-ary constraints, the algorithm considerably outperforms graph cuts in 
solving image segmentation. It was ten times fast and the error rate was reduced 
by two to three factors. 

Using the constraint model, the only difference between image segmentation 
and stereo vision are the contents of the unary constraints. Although we only 
present the experiment results using the smoothness constraint in this paper. The 
very promising results of applying high arity constraints in image segmentation 
encourage us to use them for stereo matching as the future work. 

To choose the propagation matrix, we can have all the options as long as 
the matrix is square, irreducible, and with non-negative elements as defined in 
Definition [13]. Since each site i in an image has four neighbors and is associated 
with one agent, we set w,j be nonzero (= 0.25) if and only if site j is the neighbor 
of site i. 

It was found in the experiments that difference choices of the propagation 
matrix has much less effect on the quality of solutions than on the convergence 
rate. It is reasonable because the propagation process, defined in the algorithm 
by using the propagation matrix (6), guarantees to spread local information 
E-t uniformly across all agents with any choice of the matrix. Because of this 
uniform spreading of local information, each agent can make better decisions 
in solving sub-problems cooperatively. Such a propagation process makes the 
algorithm different from the conventional optimization methods. This is another 
explanation for why the algorithm has the capability of finding global optima. 

In the iterative function (6), the parameter Afc is updated as 

A k = (k — l)/k, where k > 1. 

Hence, the cooperation becomes stronger as the iteration proceeds. 

In the experiments, we discard variable values using thresholds. That is, for 
a value Xi, if 

c\ k \xi) > min c\ k \xi) + f (fc) , 

Xi 

where = 100, and = 0.92 * t^ k ~^ if there are less than 0.1% values are 
discarded at the current iteration. Otherwise, t{k) remains unchanged. Therefore, 
those thresholds become tighter and tighter as the iteration proceeds, and more 
and more values are discarded for each variable. Eventually, there should be only 
one value left for each one. However, by using those simple thresholds, the final 
solution is not guaranteed to be the global optimum because those thresholds 
are not those suggested by theory. 

It was found through theoretical investigations that the tightening rate of 
the thresholds depends on how deep the valley containing the global optimum in 
the search space is. If it is very close to other valleys, we need a slow tightening 
rate. Otherwise, a fast tightening rate is required. 

Using the four test image pairs from the framework, the cooperative opti- 
mization is marginally better (1.85%) than graph cut in overall disparity error. 
For the Tsukuba image pair, which are close to real images from stereo match- 
ing, the cooperative optimization is 5.93% better than the graph cut (see Fig. 1). 




308 X. Huang 



Excluding occluded areas, which are not handled by both algorithms, the coop- 
erative one are also the best for all other types of areas. An occluded area is one 
that is visible in one image, but not the other. 




Fig. 1. The depth images recovered by our algorithm (left) and graph cuts (right). 



The following four tables show the performance of the cooperative algorithm 
(upper rows in a table) and the graph cut algorithm (lower rows in a table) over 
the four test image sets. The performance is measured on all areas, non-occluded 
areas, occluded areas, textured areas, texture-less areas, and discontinued areas 
such as object boundaries. Also, the runtimes of each algorithm (ca = cooperative 
algorithm, gc = graph cuts) are listed. 



image = Map (runtime: ca = 82 s / gc = 337s) 



ALL 


NON OCCL 


OCCL 


TEXTRD 


TEXTRLS 


D_DISCNT 


Error 4.08 


1.12 


16.08 


1.13 


0.47 


3.69 


Bad Pixels 5.91% 


0.53% 


90.76% 


0.52% 


0.95% 


5.15% 


Error 3.91 


1.07 


15.45 


1.07 


0.38 


3.65 


Bad Pixels 5.63% 


0.36% 


88.76% 


0.36% 


0.00% 


4.52% 


image = Sawtooth (runtime: ca 


= 288s / gc = 673 s) 


ALL 


NON OCCL 


OCCL 


TEXTRD 


TEXTRLS 


D_DISCNT 


Error 1.40 


0.68 


7.31 


0.71 


0.42 


1.62 


Bad Pixels 4.41% 


1.86% 


92.39% 


1.95% 


0.99% 


6.56% 


Error 1.49 


0.70 


7.88 


0.73 


0.40 


1.60 


Bad Pixels 3.99% 


1.38% 


94.02% 


1.49% 


0.31% 


6.39% 


image = Tsukuba (runtime: ca 


= 174s / gc = 476s) 


ALL 


NON OCCL 


OCCL 


TEXTRD 


TEXTRLS 


D_DISCNT 


Error 1.18 


0.81 


5.43 


0.95 


0.55 


1.67 


Bad Pixels 4.03% 


1.75% 


90.21% 


2.54% 


0.68% 


8.11% 


Error 1.25 


0.92 


5.35 


1.04 


0.73 


2.02 


Bad Pixels 4.24% 


2.04% 


87.60% 


2.77% 


1.05% 


10.00% 


image = Venus (runtime: ca = 


= 465s / gc = 573s) 


ALL 


NON OCCL 


OCCL 


TEXTRD 


TEXTRLS 


DJDISCNT 


Error 1.48 


1.02 


7.92 


0.88 


1.25 


1.42 


Bad Pixels 4.40% 


2.77% 


91.40% 


2.38% 


3.57% 


9.68% 


Error 1.47 


0.95 


8.33 


0.81 


1.18 


1.31 


Bad Pixels 3.58% 


1.93% 


91.55% 


1.56% 


2.68% 


6.84% 
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4 Conclusions 

A formal description of a cooperative optimization algorithm has been presented. 
It is based on a system of multi-agents working together with a novel coopera- 
tion scheme to optimize the global objective function of the system. A number of 
important computational properties of the algorithm have also been presented 
in this paper. To demonstrate its power, a case study of stereo matching from 
computer vision has been provided. Using a common evaluation framework pro- 
vided by Middlebury College, the system has shown a performance comparable 
with the graph cut algorithm. 
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Abstract. Velocity distributions are an enhanced representation of im- 
age velocity implying more velocity information than velocity vectors. 
Velocity distributions allow the representation of ambiguous motion in- 
formation caused by the aperture problem or multiple motions at a given 
image region. Starting from a contrast- and brightness-invariant gener- 
ative model for image formation a likelihood measure for local image 
velocities is proposed. These local velocities are combined into a coarse- 
to-fine-strategy using a pyramidal image velocity representation. On each 
pyramid level, the strategy calculates predictions for image formation 
and combines velocity distributions over scales to get a hierarchically 
arranged motion information with different resolution levels in velocity 
space. The strategy helps to overcome ambiguous motion information 
present at fine scales by integrating information from coarser scales. In 
addition, it is able to combine motion information over scales to get 
velocity estimates with high resolution. 



1 Introduction 

Traditionally, motion estimates in an image sequence are represented using vec- 
tor fields consisting of velocity vectors each describing the motion at a particular 
image region or pixel. Yet in most cases single velocity vectors at each image 
location are a very impoverished representation, which may introduce great er- 
rors in subsequent motion estimations. This may, e.g., be because the motion 
measurement process is ambiguous and disturbed by noise. The main problems 
which cause these errors are the aperture problem, the lack of contrast within 
image regions, occlusions at motion boundaries and multiple motions at local 
image regions caused by large image regions or transparent motion. 

To circumvent these problems, the velocity of an image patch at each location 
is understood as a statistical signal. This implies working with probabilities for 
the existence of image features like pixel gray values and velocities. The expecta- 
tion is that probability density functions are finally able to tackle the addressed 
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Fig. 1. Comparison of multiscale image motion representation: a) velocity vectors 
vi , V 2 as standard representation and b) velocity distributions pi , p 2 as enhanced rep- 
resentation. The single column with arrows in b) represents a velocity distribution at 
one location; it is shown in c) as the corresponding 2-dimensional graph. In d) and 
e) the velocity decomposition principle is shown schematically. In d), we assume that 
the true velocity v is decomposed into velocity vectors (e.g. vi and V 2 in the figure) 
at different scales. In e), we do the analog procedure for velocity distributions: We 
combine pi(vi) and p 2 (y 2 ) in such a way that we get a total p(v) with v = vi + V2. 



problems related to motion processing, like ambiguous motion, occlusion and 
transparency, since some specific information about them can in principle be 
extracted from the probability functions [3]. During the last ten years velocity 
distributions have been motivated by several authors [3], [4], [5] mainly using 
two approaches: the gradient based brightness change constraint equation and 
the correlation-based patch matching technique. 

A problem when dealing with velocity distributions is how to represent them in 
a multiscale pyramid. Such a pyramid is desirable e.g. for being able to repre- 
sent both high and low velocities at good resolutions with a reasonable effort. 
This is usually done in such a way that the highest velocities (connected with 
the coarsest spatial resolution) are calculated first, then a shifted [1] version of 
the image is calculated using the velocities, and afterwards the velocities of the 
next pyramid level are calculated. These then correspond to relative velocities 
because they have been calculated in a frame that is moving along with the 
velocities from the coarse resolution. But still, single velocities are used for the 
shifted version of the image, so that the information available in the distribution 
is neglected. 

In this work, we first introduce a linear generative model of image patch for- 
mation over time. Here, the changes in two consecutive images depend on the 
displacements as well as brightness and contrast variations (see Eq.2) of localized 
image patches. The result are contrast and brightness invariant velocity distri- 
butions based on a correlation measure comparing windowed image patches of 
consecutive images. Afterwards, we set up a hierarchical chain of velocity dis- 
tributions from coarse to fine spatial scale and from large to smaller (relative) 
velocities. At each stage of the pyramid, the distributions for the overall ve- 
locities are improved using the distributions from the coarser spatial scale as 
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a starting point and combining it with the local measurements for the relative 
velocity at the given scale. This is done exclusively on the basis of velocity dis- 
tributions, and is different from other frameworks that operate through several 
hierarchies but rely on velocity fields when combining information from several 
hierarchy levels [1], [2]. The idea on how to combine distributions among scales 
is illustrated in Fig. 1. The presented architecture combines the advantages of 
a hierarchical structure and the representation of velocities using distributions 
and allows for a coarse-to-hne estimation of velocity distributions. 

2 Velocity Probability Distributions 

In an image sequence 1 , every image I 4 at time t consists of pixels at locations 
x. Each pixel is associated with properties like its gray value G* (scalar) and its 
velocity vector v^, whereas G 4 denotes the matrix of all gray values of image 
I 4 . The motion field is the set of all physical velocities at the corresponding 
pixel locations at a time t. The optical flow is an estimate for the real image 
motion field at a particular time t. It is usually gained by comparing localized 
patches of two consecutive images I 4 and I 4+Zi4 with each other. To do this, we 
define W©G 4,X as the patch of gray values taken from an image I 4 , whereas 
G 4 x : = T {x} G 4 are all gray values of image I 4 shifted to x. The shift-operator 
is defined as follows: T^^G^ := G f x _ Ay .. The W defines a window (e.g. a 2- 
dimensional Gaussian window) that restricts the size of the patch. One possibility 
to calculate an estimate for the image velocities is now to assume that all gray 
values inside of a patch around x move with a certain common velocity v x for 
some time At, resulting in a displacement of the patch. This basically amounts 
to a search for correspondences of weighted patches of gray values (displaced to 
each other) W0G t+4i,x+4x and W©G 4 ' X taken from the two images I 4+Zi4 and 
I 4 . 

To formulate the calculation of the motion estimate more precisely, we recur 
to a generative model. Our Ansatz is that an image J t + At is causally linked 
with its preceding image I 4 in the following way: We assume that an image I 4 
patch WqG 4 ' x with an associated velocity = Ax./ At is displaced by Ax 
during time At to reappear in image J t + At at location x + Ax, so that for this 
particular patch it is 

W©G 4+44 ’ x+Ax = W©G 4 ’ x . (1) 

In addition, we assume that during this process the gray levels are jittered by 
noise rj, and that brightness and contrast variations may occur over time. The 
brightness and contrast changes are accounted for by a scaling parameter A and 
a bias n (both considered to be constant within a patch) so that we arrive at 

[W©G 4+/l4 ’ x+Ax ] = A [W©G 4 ’ x ] + K W + rj 1 . (2) 

1 Notation: We use simple font for scalars (a, A), bold for vectors and matrices (a. A), 
and calligrafic font for functions and operators (A). i, 1 are vector of ones and matrix 
of ones, A©B denotes a componentwise multiplication of two vectors or matrices and 
A® a componentwise exponentiation by a of a vector or matrix. 
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Assuming that the image noise is zero mean gaussian with variance a v , the 
likelihood that G* x is a match for G t+2l *’ x+/ix , given a velocity v£, the window 
function W and the parameters A, k and a v , can be written down as 2 : 



p A,k 



At Ax |y£ 



W,G*’ X ) 



- || W©(A G‘’*+k l-G t+4t ' : 



(3) 



We now proceed to make it less influential of A and k, that means a match of the 
patches will be almost contrast and brightness invariant. For this purpose, we 
maximize the likelihood Eq. 3 with respect to the scaling and shift parameters. 
This amounts to minimizing the exponent, so that we want to find 

{A*, k*} := argmin A K || W© (A G tx + k 1 - G t+At ’ x+Ax ) f . (4) 



The final result of this minimization process is formulated in Eq. 7. Consider 
{A*, ft*} := argmin A K ||W© (A A + n 1 — B) || 2 . (5) 



This leads to A* = gA B — — and k* = pe — A* • pa - 3 (6) 

o a 

Inserting Eq. 6 into Eq. 3, so that A — > A* and k — >■ k*, leads to the following 
likelihood formulation 

p*(x|v) :=PA.,«.,^(G‘ +At ’ x+/lx |vLW,G t ’ x )~e“5'( A ^ i ) ( 1 -«S.*.*,a*+^.«+^) . 

(7) 

The weighted empirical correlation coefficient pa.b is well known in statistics 
as an effective template matching measurement. Eq. 7 shows some additional 
properties according to comparable likelihood measures [4], [3], [5]. The mea- 
sure 0 Qt, x Qt+^t.x+^x ensures that the match is less affected by local changes in 
contrast and brightness. Local changes in illumination due to movement of an 
object when there is a fixed light-source or changes in illumination because of 
movement of the light-source itself does less reduce the accuracy of the measure- 
ment of the likelihood. 

Another property of Eq. 7 is given by the ratio of variance of the patch at lo- 
cation x to the variance of the gaussian distributed noise <tgc* j a r] . The higher 
this ratio the smaller the overall variance a = a^/aat.x. That means, if <tgc* is 
high, then u is low and mainly the good patch matches contribute to the distri- 
bution and it will be clearly peaked. When there is a patch with low variance 
the distribution will be broader. For higher/lower noise level a ^ , more/less high 
contrast patches are needed to get a significantly peaked distribution, so that 
for low the more also poorly matching results contribute to the likelihood 
distribution. Therefore a r] can act as a parameter to control the influence of the 
variance (Jgc* of the patch on the confidence of the distribution. 

2 The symbol ~ indicates that a proportional factor normalizing the sum over all 
distribution elements to 1 has to be considered. 
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3 Coarse-to-Fine Strategy 

Now we regard a coarse-to-fme hierarchy of velocity detectors. A single level of 
the hierarchy is determined by (i) the resolution of the images that are compared 

(ii) the range of velocities that are scanned and (iii) the window W of the patches 
that are compared. Coarser spatial resolutions correlate with higher velocities 
and larger patch windows. The strategy proceeds from coarse to fine; i.e. , first 
the larger velocities are calculated, then smaller relative velocities, then even 
smaller ones, etc. 

For a single level of resolution, we use the local velocity estimation 

p‘(v|x) ~V(x|v)p(v) (8) 

with a common prior velocity distribution p(y) for all positions x. The prior 
p(v) may be used to indicate preference of velocities, e.g. peaked around zero. 
In the resolution pyramid, at each level k we have a different velocity estimation 
Pj.(v|x) for the same physical velocity v at its corresponding physical location x. 
Velocity estimations at higher levels of the pyramid (i.e., using lower spatial res- 
olutions) are calculated using larger windows W, therefore showing a tendency 
towards less aperture depending problems but more estimation errors. To the 
contrary, velocity estimations at lower levels of the pyramid (higher resolutions) 
tend to be more accurate but also more prone to aperture problems. 
Nevertheless, the estimations at the different levels of the pyramid are not inde- 
pendent of each other. The goal of the pyramid is therefore to couple the different 
levels in order to (i) gain a coarse-to-fme description of velocity estimations (ii) 
take advantage of more global estimations to reduce the aperture problem and 

(iii) use the more local estimations to gain a highly resolved velocity signal. The 
goal is to be able to simultaneously estimate high velocities yet retain fine ve- 
locity discrimination abilities. 

In order to achieve this, we do the following: The highest level of the pyramid 
estimates global velocities of the image. These velocities are used to impose a 
moving reference frame for the next lower pyramid level to estimate better re- 
solved, more local velocities. That is, we decompose the velocity distributions in 
a coarse-to-fme manner, estimating at each level the relative velocity distribu- 
tions needed for an accurate total velocity distribution estimation. 

The advantages of such a procedure are manifold. If we want to get good es- 
timates for both large and highly resolved velocities/distributions without a 
pyramidal structure, we would have to perform calculations for each possible 
velocity, which is computationally prohibitive. In a pyramidal structure, we get 
increasingly refined estimations for the velocities starting from inexpensive, but 
coarse initial approximations and refining further at every level. 

At each level of the pyramid, we do the following steps: 

1. Start with inputs 



G t~\~At (-i t 

k > ^*k • 



(9) 
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is the level k prediction of all gray values of image using the 

information available at t. E.g., for the highest level with k = 0, Gq = Gg’ 4- ' 44 , 
since there are no further assumptions about velocities v (i.e. v = 0). 

2. Calculate the local likelihood for the k-tlr level velocity 



/5fe( x K) 




( 




, G 



k 



) 



(10) 



as formulated in Eq. 7. Note that at the highest level, v 0 is equal to the 
physical velocity v from pj.(x|v), whereas at lower levels, v fc is a differential 
velocity related with the likelihood estimation p£(x|vfc). Note also that f/j, 
correlates G^; x (and not Gj.’ x ) with q 1 + a ^ x + Ax ^ 

3. Calculate the local likelihood for the physical velocity v by combining the 
estimation for the physical velocity from the higher stage k — 1 with the 
likelihood estimations from stage k, 

Pfc( x l v = Vfc-i + v fc ) 4(x + v fc _i^|vfe)p^_ 1; (v fc _i|x) . (11) 

Vk,V k - 1 



At the highest level there will be no combination because no velocity distri- 
butions from a coarser level are available and therefore /Oq(x|v) := /5q(x|v). 

4. Combine the likelihood with the prior pk(yk) to get the local a-posteriory 
probability for the physical velocity v according to 

4( v l x ) ~ 4( x l v )Pfc(v) • (12) 



5. Use the gained a-posteriory probability for the prediction of the image at 
time t + At. at the next level k + 1 according to 

G 4 +1 := ^(v|x)W x - v ^0G 4 fc+1 . (13) 

V,X 



This is the best estimate according to level k and the constraints given by 
the generative model. W x ~ vAt is the window shifted by x — vAt and takes 
into account the correct window weightings. 

6. Increase the pyramid level k and return to point 1. 



4 Results 

The results of the hierarchical procedure in Fig. 3 show that a combination of 
velocity distributions is possible within a velocity resolution pyramid and that 
the process combines advantages of the different levels of resolution. The coarser 
levels of the pyramid analyze larger patches and provide estimations for larger 
velocities. Nevertheless, the estimations are often inaccurate, dependent on the 
shape of the velocity distributions. In contrast, the finer levels of the pyramid 
operate more locally and analyze smaller velocities. This leads in some cases to 
peaked velocity distributions, but in other cases (e.g., when there is not sufficient 





Fig. 2. The results of the hierarchical velocity distribution calculations using the reso- 
lution pyramid. The right column shows flow fields (extracted for evaluation purposes). 
The left three columns show the original images at time t (first column) and t + At 
(third column), as well as the reconstructed image G j. (second column), which is the 
prediction of image at time t + At (third column) using the available velocity distribu- 
tions at time £.At each resolution level, a number of representative 2D velocity distri- 
butions p\(y |x) (absolute velocities v) and pj.(vfc|x) (relative velocities v^) are shown, 
with gray arrows indicating their positions in the image. For the distributions, white 
lines indicate the velocity coordinate system, with black/white representing low/high 
probabilities for the corresponding velocity. Black arrows indicate the order of the com- 
putations within the pyramid. The coarse-to-fine calculation allows the system to use 
the coarse velocity information for regions which at higher levels of resolution have flat 
distributions or ambiguous motion signals, and to refine the coarse information with 
additional information from the finer levels of velocity resolution. This can be seen at 
the flow field at the highest level of resolution (“Flow held 2”). In “Flow held 2” only 
the velocity vectors with high probabilities are shown. 

structure) to broad distributions because of unavailable motion information. The 
combination of coarse and fine levels using the velocity distribution representa- 
tion allows to incorporate more global velocity estimations if local information is 
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Fig. 3. For comparison, here we show the velocity distributions and the flow field calcu- 
lated over the entire velocity range using fine velocity discretization and fine resolution 
window. In the flow field we see that the velocity signals are only unambiguous at the 
corners of the squares, whereas the central parts convey no motion information and 
the edges suffer the classical aperture problem. Using the squared magnitude of the 
difference between the correct and estimated flow the pyramidal approach produces 
54,4% less error than this one. 



missing, and to refine global velocity estimations if local information is present. 
An advantage of a pyramidal structure for velocity computation is that we gain 
the coarse estimations very fast, and can then refine the results step by step. 
The strategy is comparable to detecting global motions first, and then to use this 
information in a moving coordinate frame, in order to detect the finer relative 
motions still available within this frame. 
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Abstract. Self-calibration for imaging sensors is essential to many computer 
vision applications. In this paper, a new stratified self-calibration method is 
proposed for a stereo rig undergoing planar motion with varying intrinsic 
parameters. We show that the plane at infinity in a projective frame can be 
identified by (i) a constraint developed from the properties of planar motion for 
a stereo rig and (ii) a zero-skew assumption of the camera. Once the plane at 
infinity is identified, the calibration matrices of the cameras and the upgrade to 
a metric reconstruction can be readily obtained. The proposed method is more 
flexible than most existing self-calibration methods in that it allows all intrinsic 
parameters to vary. Experimental results for both synthetic data and real images 
are provided to show the performance of the proposed method. 



1 Introduction 

Self-calibration methods for imaging sensors have been an active research subject in 
recent years. Based on the rigidity of the scene and the constancy of the internal 
camera parameters, many results have been obtained [1-6, 11-15], Many existing self- 
calibration methods try to solve for the intrinsic parameters immediately after the 
projective reconstruction, which has the drawback of having to determine many 
parameters simultaneously from nonlinear equations. This prompts the development 
of stratified approaches for calibrating cameras [12, 13] and stereo rigs [2, 3, 4, 5]. 

‘Stratified’ means converting a projective calibration first to an affine calibration 
and then to the Euclidean calibration. Along this line, M. Pollefeys and L. Van 
proposed a stratified self-calibration method [13]. The method uses the modulus 
constraint to identify the plane at infinity in projective frame so as to upgrade the 
projective reconstruction to affine and then Euclidean reconstruction. But the method 
does not uniquely identify the plane at infinity. R.I. Hartley proposed a method to 
search the plane at infinity in a limited 3 -dimensional cubic region in the parameter 
space [12]. Although the method can uniquely identify the plane at infinity, the 
precision of the solution is constrained by the quantization of the searching space. A 
Ruf et al. proposed a stratified self-calibration method for a stereo rig [3] which 
identifies the plane at infinity much more effectively through decomposing the 
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projective translation. The drawback of the method is that the motion of the stereo rig 
is restricted to pure translation. A. Zisserman et. al. proposed a calibration method for 
stereo rig undergoing general motion [4]. Subsequently, using different projective 
reconstructions that are associated with each stereo pair, R. Horaud and G. Csurka [5] 
proposed a stratified self-calibration method for a stereo rig in general motion. 
Although these stratified self-calibration methods for stereo rig can determine the 
plane at infinity more effectively, they all assume that the cameras of the stereo rig 
have constant intrinsic parameters and the stereo correspondences are known. 

In this paper we shall present a new stratified self-calibration method for a stereo 
rig that has varying intrinsic parameters while undergoing planar motion. To identify 
the plane at infinity, our method does not require any information of ideal points or 
points in the 3D scene. The only assumption used is that the cameras are of zero 
skew. Simulations on synthetic data and experimental results using real image 
sequences show that the proposed method deals with the cases of varying intrinsic 
parameters effectively and is robust to noise. 



2 A New Stratified Self-Calibration Method 

Consider an uncalibrated binocular stereo rig undergoing planar motion, with varying 
intrinsic parameters. It is assumed the stereo rig is set up such that the base line of the 
two cameras of the stereo rig is not parallel to the motion plane. For the purpose of 
reference, the two cameras will be referred to as upper and the lower cameras 
depending on their physical position. Suppose sequences of images are captured by 
both cameras of the stereo pair. By matching image points across all the images, we 
can obtain a projective reconstruction using the method described in [7]. It is 
assumed that such a projective reconstruction has been performed and this will be 
taken as the starting point in our self-calibration method. 

First, we select one image, denoted I m from the sequence of the lower camera as the 
reference image. The projection matrix for the reference image I a can be taken as 
P 0 =[I|()]. Let the other images taken by the stereo rig be /, (/= 1 , with 

projection matrices given by P ; = | VI |m, 1 , ( i = 1 ,..., n ) . 

Suppose the plane at infinity 7t„ is given in the projective frame by|\ r 
where v = [v, v, v :) J 7 ■ Then, the infinite homography between images /, and I 0 can be 
written as: 



H „ i0 = M ; -m v T (1) 

To upgrade the projective reconstruction to an affine one, the plane at infinity K„ in 
the projective frame must be identified. This is the most difficult step in the stratified 
self-calibration procedure. We will next propose a method for identifying 7t„ using 
the properties of planar motion. 

Figure 1 shows the geometry of a camera in planar motion, where all the camera 
centers O h 0 2 ... O, lie on a plane called the camera motion plane, denoted n. Let e n 
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( i=2,...,n ) be the epipole of image I, on I t . The intersection lines of the plane jt and 
the image planes are the vanishing lines of the camera motion plane in the images. 

Obviously, the epipoles e ( i =2, n) of all the other images on lie on the 

vanishing line of the camera motion plane on Ij. So the vanishing line can be obtained 
by fitting a straight line to the epipoles. A similar argument can be applied to obtain 
the vanishing line of the plane n on any image I, of the sequence. 




Fig. 1 . Geometry of a camera making planar motion 



Since the motion of the stereo rig is rigid, the motion planes of the two cameras are 
parallel to each other and have the same vanishing line in each image. 

Select one image, denoted I h from the sequence of the upper camera. Using the 
above method, the vanishing lines of the (upper or lower) camera motion plane in the 
images I a and // can be identified, which will be denoted 1 0 and li respectively. Under 
the rule of transforming lines under infinite homography, we have 



where the symbol °c means equal up to an unknown scaling factor. To eliminate the 
unknown scalar, we write the equation (2) in the form of a cross product: 



0 li =[1 0 L (M I r -ym 1 r )l I =0 . 



where [i 0 ] is the 3 by 3 skew-symmetric matrix associated with cross product. Since 

there are only two independent entries in a 3-vector which represents a line in 
homogeneous coordinate, equation (3) provides two independent equations for the 
three unknowns in v=[ V/ , V 2 , v?] T . Under the assumption that the upper and the lower 
cameras do not lie on the same motion plane, we can make use of (3) to solve V/ and 
V 2 in terms of v’;. Substituting the vy and \>2 back into (1), we get the infinite 
homographies between images /, and /„ with one unknown variable v ? . 

To determine vj, we need another constraint on the infinite homographies. Let the 
image of the absolute conic (IAC) on I„ and /, be to 0 and or,, respectively. Under the 
rule for transforming conics by a homography, a> 0 and co, satisfy: 



(4) 
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The IAC to; on image /, is related to the calibration matrix K t = 



camera as: 



«,i s i X 0i 

0 a y, y m 

0 0 1 



of its 



( 5 ) 



co =K T K l = 



4 



-yA, +yM 



-Wy 

C^+Sf 

-<%y a -Oo 



rfA+4ii+(a fa-s,y a f 



It is reasonable to assume that all cameras have zero-skew, that is s,= 0. In this case, 
we have (<O;)i2=0. Imposing this constraint on (4), we get one linear equation in the 
entries of (Do: 

(h;> 0 h;; o ) 12 =o . (6) 

Where (X) y - represents the (ij ) th element of the matrix X. With five image pairs (7 0 , /,) 
(/= 1 ,2, . . . ,5), five such equations can be obtained, which can be written as: 

AW 0 = 0 . (7) 

where W 0 is a 5-vector made up of the distinct entries of the symmetric matrix e> 0 , 
and A is a 5x5 square matrix parameterized by the variable v 3 . Since the entries of 
W 0 cannot be all zero, the square matrix A must be singular, that is, A satisfies 
det(A)=0. This is a polynomial equation in v 3 which in general admits 8 solutions. By 
checking the feasibility of the solutions (i.e. v 3 must be real and the IAC must be a 
positive definite matrix), the variable v 3 can be uniquely determined. 

In practice, with existence of noise det(A) may deviate from zero. A more robust 
solution for v 3 can be obtained by the following algorithm: 

With n image pairs ( I „ , /,) (i= 1 , ... ,n), where n> 5, a nx5 matrix A is formed as in (7). 

1. Select 5 rows form the matrix A randomly to get a 5x5 square matrix A'. Solving 
det(A')=0, we get eight solutions of V 3 . By substituting v 3 back into the matrix A' 

and checking weather the IAC is positive definite, we can determine a unique 
solution for v 3 . 

2. Substituting the value of v 3 back into the matrix A, calculate the least singular 
value of the matrix A. Save the value of v 3 and the least singular value, 

3. Repeat Steps 2 and 3 a number of (say, ten) times. 

4. Compare all the least singular values obtained in Step 3 and choose as the true 
value of v 3 the one for which the least singular value of A is smallest. 
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After solving for and substituting it back into (7) and (1), k> 0 and all the infinite 
homographies between images /, and I 0 can be solved. With (4) the IAC’s w, 
(i=l,...,n) can be determined. By a Cholesky factorization of or,, we can solve for 
the calibration matrices K, for each image. 



3 Experimental Results 

3.1 Simulations with Synthetic Data 

Simulations are carried out on image sequences of a synthetic scene. The scene 
consists of 50 points uniformly distributed within the unit sphere centered at the 
origin of the world coordinate. The synthetic stereo rig moves on the x-y plane along 
a circular path of 3 meter radius. The directions of the two cameras of the stereo rig 
are different. The intrinsic parameters are different for different views and are 
determined as follows:/,, and / are chosen randomly from the range [980, 1200] and 
[960, 1000] respectively, x 0 and Vo chosen randomly from the range [498, 502] and 
[398, 402] respectively, and the skew is zero. The image size is 1300x1000 pixels. 




(a) 



(b) 




o 



CD 

CO 

£ 





(C) 



(d) 



Fig. 2. Relative RMS errors on estimated camera intrinsic parameters for the image I 0 with 
different noise level, (a) relative RMS error of f x [pixel] (b) relative RMS error of f y [pixel] (c) 
relative RMS error x 0 (d) relative RMS error of y 0 



In the simulation, N images are captured by each of the upper and lower cameras. 
Gaussian white noise with standard deviations of 0.1, 0.5, 1 and 1.5 pixels are added 
to the synthetic image points. For each noise level, the self-calibration method is run 
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10 times and the relative RMS errors on the estimated camera intrinsic parameters for 
the reference image I 0 are calculated. The simulation results of the method for A'= 1 0 
and 20 are shown in Figure 2. 



3.2 Real Sequencel 

The first experiment is designed to evaluate the proposed method when each of the 
two cameras of the stereo rig has fixed intrinsic parameters. In the experiment, the 
intrinsic parameters of the upper and lower cameras are different, but the parameters 
of each camera are kept constant during motion. Seven images are taken by each of 
the upper and lower camera, two of which are shown in fig. 3. The resolution of the 
images is 3000x2000 pixels. 





Fig. 3. Pair of images used in experiment 1 
Table 1. Calibration results in experiment 1 





f x (Pixel) 


f y (pixel) 


X 0 (pixel) 


y 0 (pixel) 


upper 

camera 


10243 


10245 


1598 


815 


10370 


10373 


1546 


814 


10311 


10313 


1583 


815 


lower 

camera 


9925 


9937 


1577 


968 


9876 


9888 


1578 


964 


9972 


9985 


1561 


966 



In the experiment, the vanishing lines for two images I 0 and / ; are determined using 
all 7 images of the lower and the upper sequence respectively. Then, all the 14 images 
are used to form a 13x5 matrix A. 

Three images are selected from each of the lower and upper sequences and the 
proposed method is used to find the intrinsic parameters for these images. The 
experimental results are shown in table 1, where the first three rows are for the three 
images taken by the upper camera, and the last three rows are for the three images 
taken by the lower one. Since the intrinsic parameters calculated for different images 
taken by the same camera are nearly constant, the calibration results are quite 
reasonable. 
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3.3 Real Sequence2 

Experiment 2 is designed to evaluate the proposed method when the cameras of the 
stereo rig have varying intrinsic parameters. Ten images are taken by each camera of 
the stereo rig, three pairs of which are shown in fig.4. The CCD cameras of the stereo 
rig are zoomed during motion so that the focal length is different for different views. 
The resolution of the images is 3000x2000 pixels. 




Fig. 4. Three pairs of images taken by zooming camera used in experiment 2 



Table 2. Calibration results of zooming cameras 





fx ( P ixel > 


f y (pixel) 


X 0 (pixel) 


To (pixel) 


upper 

camera 


9251 


9260 


1500 


731 


12084 


12104 


1242 


686 


10404 


10397 


1263 


828 


lower 

camera 


8178 


8192 


1555 


938 


10115 


10135 


1486 


957 


10158 


10156 


1579 


836 



Experimental results for the six images shown in Fig.4 are listed in table 2. By 
comparing Fig.4 and Table 2, we see that the camera’s zooming positions are 
consistent with (the focal length of) the calibration results. 



4 Conclusions 

This paper describes a new method for self-calibration of a stereo rig in planar motion 
with varying intrinsic parameters. In previous research, calibration models used are 
either too restrictive (constant parameters) or not general enough (e.g. only the focal 
length can be varied). In practice, changes in focal length are usually accompanied by 
changes of the principle points. The method presented in this paper is more flexible 
than existing results in that it allows all intrinsic parameters to vary. 
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We have shown in this paper how to identify the infinite plane in a projective frame 
from rigid planar motion of a stereo rig. The projective calibration is upgraded to 
metric calibration under the assumption that the cameras are of zero skew. The 
simulation results show that the method is robust to the influence of noise. The two 
experiments with real images provide further justification for the self-calibration 
method. 
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Abstract. This work is concerned with real-time feature tracking for 
long video sequences. In order to achieve efficient and robust tracking, 
we propose two interrelated enhancements to the well-known Shi-Tomasi- 
Kanade tracker. Our first contribution is the integration of a linear illumi- 
nation compensation method into the inverse compositional approach for 
affine motion estimation. The resulting algorithm combines the strengths 
of both components and achieves strong robustness and high efficiency 
at the same time. Our second enhancement copes with the feature drift 
problem, which is of special concern in long video sequences. Refining 
the initial frame-to-frame estimate of the feature position, our approach 
relies on the ability to robustly estimate the affine motion of every fea- 
ture in every frame in real-time. We demonstrate the performance of our 
enhancements with experiments on real video sequences. 



1 Introduction 

Feature tracking provides essential input data for a wide range of computer vision 
algorithms, including most structure- from-motion algorithms [1]. Other impor- 
tant applications that depend on successful feature tracking are, for example, 
camera self-calibration [2] and pose estimation for augmented reality [3] . 

The well-known Shi-Tomasi-Kanade tracker has a long history of evolutionary 
development. Its basic tracking principle was first proposed by Lucas and Kanade 
in [4]. For tracking a feature from one frame to the next, the sum of squared 
differences of the feature intensities is iteratively minimized with a gradient 
descent method. The important aspect of automatic feature detection was added 
by Tomasi and Kanade in [5]. 

Shi and Tomasi introduced feature monitoring for detecting occlusions and 
false correspondences [6]. They measure the feature dissimilarity between the 
first and the current frame, after estimating an affine transformation to correct 
distortions. If the dissimilarity exceeds a fixed threshold, the feature is discarded. 
This method was further refined in [7], where the X84 rejection rule is used to 
automatically determine a suitable threshold. 

* This work was partially funded by the European Commission’s 5th 1ST Programme 
under grant IST-2001-34401 (project VAMPIRE). Only the authors are responsible 
for the content. 



C.E. Rasmussen et al. (Eds.): DAGM 2004, LNCS 3175, pp. 326-333, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




Efficient Feature Tracking for Long Video Sequences 327 



Baker and Matthews propose a comprehensive framework for template align- 
ment using gradient descent [8], as employed by the Slri-Tomasi-Kanade tracker. 
In contrast to the algorithm of Lucas and Kanade, they suggest estimating the 
inverse motion parameters and updating them with incremental warps. Their 
inverse compositional approach facilitates the precomputing of essential opera- 
tions, considerably increasing the speed of the algorithm. 

As its motion estimation is completely intensity-based, the feature tracker is 
very sensitive to illumination changes. Jin et al. developed a method for simul- 
taneous estimation of affine motion and linear illumination compensation [9]. 
Our first contribution is the combination of Jin’s method with Baker’s inverse 
compositional approach. We evaluate our new algorithm by comparing it with 
the intensity distribution normalization approach suggested in [7]. 

Due to small parameter estimation errors, features tracked from frame to 
frame will slowly drift away from their correct position. We propose to solve the 
feature drift problem by incorporating the results of the affine motion estimation. 
Another solution with respect to the tracking of larger templates is put forward 
by Matthews et al. in [10]. 

After a short overview of our tracking system in the next section, we present 
the combined motion estimation and illumination compensation algorithm in 
Sect. 3. Our approach for solving the feature drift problem is detailed in Sect. 4. 
Finally, we demonstrate experimental results in Sect. 5. 



2 Tracking System Overview 

Our goal of real-time feature tracking for long video sequences not only led to 
the enhancement of key components of the Slri-Tomasi-Kanade tracker, but also 
required a careful arrangement of the remaining components. In this section, we 
will shortly explain these additional design considerations. 

We employ the feature detector derived in [5]. It was designed to find opti- 
mal features for the translation estimation algorithm of the tracker. Tomasi and 
Kanade also discovered that detected corners are often positioned at the edge of 
the feature window [5] . As this phenomenon can lead to suboptimal tracking per- 
formance, we use smaller windows for feature detection than for feature tracking. 
Consequently, even if a corner lies at the edge of the detection window, it is well 
inside the actual tracking window. Another possibility is to emphasize the inner 
pixels of the detection window by applying Gaussian weights. Unfortunately, this 
method did not further improve the tracking in our experiments. 

When feature tracking is performed on long video sequences, losing features 
is inevitable. As we want to keep the number of features approximately con- 
stant, lost features have to be replaced regularly. In order to retain the desired 
real-time performance, we devised a hierarchical algorithm which successively 
selects the best features according to the ranking provided by the interest im- 
age. After the selection of one feature, only a local update of the algorithm’s 
data structure is required. Additionally, the algorithm is able to enforce a mini- 
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Fig. 1 . In the top row, five instances of a feature that was tracked with the proposed 
algorithm are shown. The respective reconstructions in the bottom row illustrate the 
performance of the affine motion estimation and the linear illumination compensation. 



mum distance between each new feature and all other features, which prevents 
wasting computational resources on tracking highly overlapping features. 

The main task of feature tracking is to estimate the translation of a feature 
from one frame to the next. Lucas and Kanade observed that the basin of con- 
vergence of their gradient descent algorithm can be increased by suppressing 
high spatial frequencies [4] . The amount of smoothing is bounded by the size of 
the feature windows, because at least some structure has to remain visible for 
a meaningful registration. In order to increase the maximum displacement that 
can be tolerated by the tracker, we employ a Gaussian image pyramid coupled 
with a coarse-to-fine strategy for translation estimation. Usually working with 
three levels of downsampled images, we can considerably extend the basin of con- 
vergence. Another important addition is the linear motion prediction, which is 
especially beneficial when a feature moves with approximately constant velocity. 

After the affine motion estimation, which is discussed in the next section, 
outliers have to be detected and rejected. Although the dynamic threshold com- 
putation in [7] is promising, we rely on a fixed threshold for the maximum SSD 
error. In our experience, the gap between correctly tracked features and out- 
liers is sufficiently large when illumination compensation is performed. Jin et 
al. discard features whose area falls below a given threshold [9] . We extend this 
method by observing the singular values of the affine transformation matrix, 
which represent the scale of the feature window along the principal axes of the 
affine transformation. This way, we can also reject features that are extremely 
distorted, but have approximately retained their original area. 

3 Efficient Feature Tracking 

After estimating the translation of a feature from one frame to the next, we 
compute its affine motion and the illumination compensation parameters with 
respect to the frame of its first appearance. By continually updating these pa- 
rameters in every frame, we are able to successfully track features undergoing 
strong distortions and intensity changes, as illustrated in Fig. 1. In addition, this 
approach allows us to discard erroneous features as early as possible. 
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In order to achieve real-time performance, we adopt the inverse compositional 
approach for motion estimation proposed in [8]. The traditional error function 
is 



J^(f( x ) ~ ft(g(x,p + Ap))) 2 , (1) 



X 



where f(x) and ft(x) denote the intensity values of the first frame and the 
current frame, respectively. In our case, the parameterized warp function g is 
the affine warp 



g(x,p ) 



fl+Pl P2 
V P3 1 + Pi 



X 



(2) 



where x represents 2-D image coordinates and p contains the six affine motion 
parameters. By swapping the role of the frames, we get the new error function 
of the inverse compositional algorithm 



- f t (g(x,p))) 2 . (3) 



Solving for Ap after a first-order Taylor expansion yields 



Ap = H 1 



(Mg(x,p)) -/(*)) 



(4) 



with n = E(v/w|) ( v -fW |) ■ 

The increased efficiency of the inverse compositional approach is due to the fact 
that matrix H 1 can be precomputed, as it does not depend on the current 
frame or the current motion parameters. The new rule for updating the motion 
parameters is 

g(x,p new ) = g{g{x,Ap)~ 1 ,p) . (5) 

We combine the efficient inverse compositional approach with the illumi- 
nation compensation algorithm presented in [9], in order to cope with intensity 
changes, which are common in video sequences of real scenes. They can be caused 
by automatic exposure correction of the camera, changing illumination condi- 
tions, and even movements of the captured objects. 

The linear model af(x) + (3, where a adjusts contrast and (3 adjusts bright- 
ness, has proven to be sufficient for our application (compare Fig. 1). With this 
illumination compensation model, our cost function becomes 



( a f(g( x ’ A P)) + 0 ~ ft(g(x,p))) 2 . (6) 



Computing the first-order Taylor expansion around the identity warp g(x, 0 ) 
gives us 
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With the introduction of the new vectors 

q = (aApi aAp 2 aAp?, aAp 4 aAp$ aAp$ a (3 ) T , (8) 

h(x) = (xf x (x) yfx(x) xfy(x) yfy(x) f x (x) f y (x) f(x) 1 ) T , (9) 

we can rewrite Equation (7) as 

J2(h(x) T q- f t (g(x,p))) 2 . (10) 

X 

Solving this least-squares problem finally results in 

Q = ‘ ( n ) 

As can easily be seen, the 8x8 matrix composed of dyadic products of vector 
h(x) is still independent of the current frame and the current motion parameters. 
Therefore, it only has to be computed and inverted once for each feature, which 
saves a considerable amount of computation time. Additionally, the simultaneous 
estimation of motion and illumination parameters promises faster convergence. 

4 Feature Drift Prevention 

There are several reasons why the feature windows in two frames will never be 
identical in video sequences of real scenes: 

— image noise, 

— geometric distortions (rotation, scaling, non-rigid deformation), 

— intensity changes (illumination changes, camera exposure correction), 

— sampling artefacts of the image sensor. 

Although these effects are usually very small in consecutive frames, it is obvious 
that frame-to-frame translation estimation can never be absolutely accurate. 
Consequently, using only translation estimation will invariably cause the feature 
window to drift from its true position when the estimation errors accumulate. 

As the feature drift problem only becomes an issue in long video sequences, 
it was not considered in early work on feature tracking [4,5]. Feature monitoring 
and outlier rejection as described in [6,7] can only detect this problem. Once the 
feature has drifted too far from its initial position, the affine motion estimation 
fails to converge and the feature is discarded. If subsequent algorithms require 
highly accurate feature positions, this shortcoming can be problematic. Jin et al. 
use affine motion estimation exclusively, thus giving up the much larger basin of 
convergence of pure translation estimation [9] . 

We propose to solve the feature drift problem with a two-stage approach. 
First, pure translation estimation is performed from the last frame to the current 
frame. Then, the affine motion between the first frame and the current frame 
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Fig. 2. Illumination compensation test sequence with 100 frames and 200 features. 
Left image: frame 0. Right image: frame 50 (lower right) / 99 (upper left). 



is estimated. Hereby, the newly computed translation parameters and the four 
affine distortion parameters of the preceding frame are used as initialization. 
The translation parameters of the new affine motion parameters constitute the 
final feature position for the current frame. Our solution requires affine motion 
estimation, preferably with illumination compensation, in every frame. This can 
now be done in real-time thanks to the efficient algorithm put forward in Sect. 3. 
Because the coordinate system for estimating the affine motion is always centered 
on the original feature, small errors in the computation of the affine distortion 
matrix will not negatively affect the translation parameters in our approach. 

5 Experimental Evaluation 

All experiments in this section were performed on a personal computer with 
a Pentium IV 2.4 GHz cpu and 1 GB main memory. The video images were 
captured with a digital firewire camera at a resolution of 640 x 480. The feature 
detector, the translation estimation, and the affine motion estimation worked 
with window sizes of 5 x 5, 7 x 7, and 13 x 13, respectively. 

We compared our new affine motion and linear illumination compensation 
algorithm of Sect. 3 with the photometric normalization approach suggested 
by Fusiello et al. [7]. They normalize the intensity distribution of the feature 
windows with respect to the mean and the standard deviation of the intensities. 
Their approach is limited to alternating estimation of motion and illumination. 

The test sequence illustrated in Fig. 2 contains 100 frames and exhibits strong 
intensity changes created by small movements of the test object. 200 features had 
to be tracked without replacing lost features. The number of successfully tracked 
features is 162 for our algorithm and 156 for the distribution normalization 
algorithm. Most of the lost features were close to the edge of the object and 
left the held of view during the sequence. As confirmed by this experiment, in 
general the robustness of both approaches is very similar. The great advantage 
of our algorithm is the lower average number of required iterations, which is 
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Fig. 3. Feature drift prevention test sequence with 220 frames and 10 features. The 
upper row shows frames 0, 60, 120, and 150 with feature drift prevention. The lower 
row shows frame 219 with (left) and without (right) feature drift prevention. 




Fig. 4. Close-up views of feature tracking with (upper row) and without (lower row) 
feature drift prevention are shown for frames 0, 80, 120, and 219 of the test sequence. 



2.21 iterations compared to 3.58 iterations for the distribution normalization 
algorithm. Consequently, with 20.9 ms against 23.9 ms overall computation time 
per frame, our tracking algorithm has a notable speed advantage. 

The feature drift prevention experiments illustrated in Fig. 3 and Fig. 4 were 
performed on a test sequence with 220 frames. 10 features were chosen auto- 
matically with the standard feature detector described in Sect. 2. The standard 
approach only tracked one feature over the whole sequence, whereas the proposed 
feature drift prevention enabled the tracker to successfully track all 10 features. 
The close-up views of selected frames shown in Fig. 4 confirm the explanations 
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given in Sect. 4. The small errors of the frame-to- frame translation estimation 
accumulate over time, finally preventing the affine motion estimation used for 
feature rejection from converging. On the other hand, using the translation pa- 
rameters of the affine motion estimation as final feature positions yields very 
accurate and stable results. 

6 Conclusion 

We proposed and evaluated two enhancements for efficient feature tracking in 
long video sequences. First, we integrated a linear illumination compensation 
method into the inverse compositional approach for affine motion estimation. 
The resulting algorithm proved to be robust to illumination changes and out- 
performed existing algorithms in our experiments. Furthermore, we overcame the 
feature drift problem of frame-to-frame translation tracking by determining the 
final feature position from the translation parameters of the affine motion estima- 
tion. We demonstrated the increased accuracy and robustness of this approach 
in our experiments. With the described enhancements, our tracking system can 
robustly track 250 features at a rate of 30 frames per second while replacing lost 
features every five frames on a standard personal computer. 



References 

1. Oliensis, J.: A Critique of Structure-from-Motion Algorithms. Computer Vision 
and Image Understanding 84 (2001) 407-408 

2. Koch, R., Heigl, B., Pollefeys, M., Cool, L.V., Niemann, H.: Calibration of Hand- 
held Camera Sequences for Plenoptic Modeling. In: Proceedings of the Interna- 
tional Conference on Computer Vision, Corfu, Greece (1999) 585-591 

3. Ribo, M., Ganster, H., Brandner, M., Lang, P., Stock, C., Pinz, A.: Hybrid Track- 
ing for Outdoor AR Applications. IEEE Computer Graphics and Applications 
Magazine 22 (2002) 54-63 

4. Lucas, B.D., Kanade, T.: An Iterative Image Registration Technique with an Appli- 
cation to Stereo Vision. In: Proceedings of the 7th International Joint Conference 
on Artificial Intelligence. (1981) 674-679 

5. Tomasi, C., Kanade, T.: Detection and Tracking of Point Features. Technical 
Report CMU-CS-91-132, Carnegie Mellon University (1991) 

6. Shi, J., Tomasi, C.: Good Features to Track. In: Proceedings of the IEEE Confer- 
ence on Computer Vision and Pattern Recognition, Seattle, USA (1994) 593-600 

7. Fusiello, A., Trucco, E., Tommasini, T., Roberto, V.: Improving Feature Tracking 
with Robust Statistics. Pattern Analysis and Applications 2 (1999) 312-320 

8. Baker, S., Matthews, I.: Equivalence and Efficiency of Image Alignment Algo- 
rithms. In: Proceedings of the IEEE Conference on Computer Vision and Pattern 
Recognition, Kauai, USA (2001) 1090-1097 

9. Jin, H., Favaro, P., Soatto, S.: Real-Time Feature Tracking and Outlier Rejection 
with Changes in Illumination. In: Proceedings of the International Conference on 
Computer Vision, Vancouver, Canada (2001) 684-689 

10. Matthews, I., Ishikawa, T., Baker, S.: The Template Update Problem. In: Pro- 
ceedings of the British Machine Vision Conference. (2003) 




Recognition of Deictic Gestures with Context* 



Nils Hofemann, Jannik Fritsch, and Gerhard Sagerer 



Applied Computer Science 
Faculty of Technology, Bielefeld University 
33615 Bielefeld, Germany 

{nhofeman, jannik, sagerer}@techf ak.uni-bielefeld.de 



Abstract. Pointing at objects is a natural form of interaction between humans 
that is of particular importance in human-machine interfaces. Our goal is the 
recognition of such deictic gestures on our mobile robot in order to enable a 
natural way of interaction. The approach proposed analyzes image data from the 
robot's camera to detect the gesturing hand. We perform deictic gesture recognition 
through extending a trajectory recognition algorithm based on particle filtering 
with symbolic information from the objects in the vicinity of the acting hand. This 
vicinity is specified by a context area. By propagating the samples depending on 
a successful matching between expected and observed objects the samples that 
lack a corresponding context object are propagated less often. The results obtained 
demonstrate the robustness of the proposed system integrating trajectory data with 
symbolic information for deictic gesture recognition. 



1 Introduction 

In various human-machine interfaces more human-like forms of interaction are devel- 
oped. Especially for robots inhabiting human environments, a multi-modal and human 
friendly interaction is necessary for the acceptance of such robots. Apart from the inten- 
sively researched areas of speech processing that are necessary for dialog interaction, the 
video-based recognition of hand gestures is a very important and challenging topic for 
enabling multi-modal human-machine interfaces that incorporate gestural expressions 
of the human. 

In every-day communication deictic gestures play an important role as it is intuitive 
and common for humans to reference objects by pointing at them. In contrast to other 
types of gestural communication, for example sign language [10], deictic gestures are 
not performed independently of the environment but stand in a context to the referenced 
object. We concentrate on pointing gestures for identifying medium sized objects in an 
office environment. Recognizing deictic gestures, therefore, means not only to classify 
the hand motion as pointing but also to determine the referenced object. Here we do not 
consider referencing object details. We will focus on the incorporation of the gesture 

* The work described in this paper was partially conducted within the EU Integrated Project 
COGNIRON ("The Cognitive Companion") funded by the European Commission Division 
FP6-IST Future and Emerging Technologies under Contract FP6-002020 and supported by the 
German Research Foundation within the Graduate Program ’Task Oriented Communication'. 
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context, i.e,, the referenced object, into a motion-based gesture recognition algorithm 
resulting in a more robust gesture recognition. 

According to Bobick [3], human motion can be categorized into three classes: move- 
ment , activity , and action. Each category represents a different level of recognition com- 
plexity: A movement has little variation in its different instances and is generally only 
subject to linear scalings, e.g., it is performed at different speeds. An activity is described 
by a sequence of movements but can contain more complex temporal variations. Both, 
movement and activity do not refer to elements external to the human performing the 
motion. Interesting for our view on deictic gestures is the class action that is defined by 
an activity and an associated symbolic information (e.g., a referenced object). Obviously, 
a deictic gesture ’pointing at object X’ can be described with this motion schema. Here, 
the low level movements are accelerating and decelerating of the pointing hand and the 
activity is a complete approach motion. Combining this activity of the pointing hand 
with the symbolic data denoting the referenced object X results in recognizing the action 
’pointing at object X’. Due to the characteristics of pointing gestures we employ a 2D 
representation for the hand trajectory based on the velocity and the change of direction 
of the acting hand in the image. 

An important topic for deictic gesture recognition is binding the motion to a symbolic 
object: During a pointing gesture the hand approaches an object. Using the direction 
information from the moving hand, an object can be searched in an appropriate search 
region. If an object is found, a binding of the object to the hand motion can be established. 
We will show how this binding can be performed during processing of the trajectory data 
resulting in an integrated approach combining sensory trajectory data and the symbolic 
object data for recognizing deictic gestures with context. We intend to use this recognition 
system for the multi-modal human-machine interface on-board a mobile robot allowing 
humans to reference objects by speech and pointing [8], 

In this paper we will first discuss related work on gesture recognition in Section 2. 
Subsequently, we give in Section 3 an overview of the presented system and the used 
modules. The Particle Filtering algorithm applied for activity recognition is described in 
Section 4. In Section 5 we show how this algorithm is combined with symbolic object 
data for recognition of deictic gestures. In Section 6 results of the system acquired in a 
demonstration scenario are presented, we conclude the paper with a short summary in 
Section 7. 



2 Related Work 



Although there is a large amount of literature dealing with gesture recognition, only very 
few approaches have actually attacked the problem of incorporating symbolic context 
into the recognition task. One of the first approaches exploiting hand motions and objects 
in parallel is the work of Kuniyoshi [7] on qualitative recognition of assembly actions 
in a blocks world domain. This approach features an action model capturing the hand 
motion as well as an environment model representing the object context. The two models 
are related to each other by a hierarchical parallel automata that performs the action 
recognition. 
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An approach dealing with the recognition of actions in an office environment is 
the work by Ayers and Shah [1]. Here a person is tracked based on detecting the face 
and/or neck with a simple skin color model. The way in which a person interacts with an 
object is defined in terms of intensity changes within the object’s image area. By relating 
the tracked person to detected intensity changes in its vicinity and using a finite state 
model defining possible action sequences, the action recognition is performed. Similar 
to Kuniyoshi’s approach, no explicit motion models are used. 

An approach that actually combines both types of information, sensory trajectory 
data and symbolic object data, in a structured framework is the work by Moore et al. [9]. 
Different image processing steps are carried out to obtain image-based, object-based, 
and action-based evidences for objects and actions. Moore et al. analyze the trajectory 
of a tracked hand with Hidden-Markov-Models trained offline on different activities 
related to the known objects to obtain the action-based evidence. 

Only the approach by Moore et al. incorporates the hand motion, while the ap- 
proaches by Kuniyoshi and Ayers and Shah rely only on the hand position. However, 
in the approach of Moore et al. the sensory trajectory information is used primarily 
as an additional cue for object recognition. We present in the following an approach 
for reaching the oppositional goal of recognizing gestures with the help of symbolic 
information. 



3 System Overview 

Due to the requirements of a fluent conversation between a human and a machine, the 
system for recognizing deictic gestures has to work in real-time. The overall deictic 
gesture recognition system is depicted in Fig. 1 . The first two modules depicted at the 
left are designed for operating directly on the image data. The module on the top extracts 
the trajectory of the acting hand from the video data by detecting skin-colored regions and 
tracking these region over time (for details see [4], chapter 4). The resulting regions are 
tracked over time using a Kalman filter with a constant acceleration model. The module 
at the bottom performs object recognition in order to extract symbolic information about 
the objects situated in the scene. This module is based on an algorithm proposed by Viola 
and Jones [1 1]. In this paper we focus on the action recognition module which contains 
an activity recognition algorithm that is extended to incorporate symbolic data from the 
object recognition. In this way, a recognition of deictic gestures with incorporation of 
their context is realized. The recognition results of the system can facilitate a multi-modal 
human-machine-interface . 




skin color segmentation | — hand tracking ^ l ' a i ec ^ or V ^ ata 



object recognition 



symbolic object data 



activity | 

^1 recognitioni 

deictic gesture recognition 



Fig. 1 . Architecture of the deictic gesture recognition system. 
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4 Activity Recognition 



Based on the trajectory generated by the acting hand of the human we can classify 
this trajectory. Since the start and end points of gestures are not explicitly given it 
is advantageous if the classification algorithm implicitly selects the relevant parts of a 
trajectory for classification. Additionally, as the same gestures are usually not identically 
executed the classification algorithm should be able to deal with a certain variability of 
the trajectory. The algorithm selected for segmentation and recognition of activities is 
based on the Conditional Density Propagation (Condensation) algorithm which is a 
particle filtering algorithm introduced by Isard and Blake to track objects in noisy image 
sequences [5] . In [6] they extended the procedure to automatically switch between several 
activity models to allow a classification of the activities. Black and Jepson adapted the 
Condensation algorithm in order to classify the trajectories of commands drawn at a 
blackboard [2]. 

Our approach is based on the work of Black and Jepson. Activities are represented by 
parameterized models which are matched with the input data. In contrast to the approach 
presented by Black and Jepson where motions are represented in an image coordinate 
system (Ax, Ay ) , we have chosen a trajectory representation that consists of the velocity 
Ar and the change of direction Ay. In this way we abstract from the absolute direction 
of the gesture and can represent a wide range of deictic gestures with one generic model. 
As the user typically orients himself towards the dialog partner the used representation 
can be considered view-independent in our scenario. 

Each gesture model m consists of a 2-dimensional trajectory, which describes the 
motion of the hand during execution of the activity. 

m (/i) = {x 0 ,xi,...,x T }, x t = (Ar t , A^ (1) 

For comparison of a model with the observed data z t = (Ar t ,Ay t ) the 

parameter vector s t is used. This vector defines the sample of the activity model // where 
the time index <b indicates the current position within the model trajectory at time t. The 
parameter a is used for amplitude scaling while p defines the scaling in time dimension. 

s t = (ptAt,at,Pt) (2) 



The goal of the Condensation algorithm is to determine the parameter vector s t 
so that the fit of the model trajectory with the observed data z t is maximized. This is 
achieved by temporal propagation of N weighted samples 



A 1 ) 






(IV) (N) 



(3) 



which represent the a posteriori probability p(s t \z t ) at time t. The weight 7iy r ^ of the 
sample s j"'* is the normalized probability p(z t |s") . This is calculated by comparing each 
scaled component of the model trajectory in the last w time steps with the observed data. 
For calculating the difference between model and observed data a Gaussian density is 
assumed for each point of the model trajectory. 

The propagation of the weighted samples over time consists of three steps and is 
based on the results of the previous time step: 
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Fig. 2. The definition of the context area. 

(n) (n) 

Select: Selection of N samples s t _\ according to their respective weight tt\_\ from the 
sample pool at time t — 1. This selection scheme implies a preference for samples 
with high probability, i.e., they are selected more often. 

Predict: The parameters of each sample s'l" 1 are predicted by adding Gaussian noise 
to at~i and p t -\ as well as to the position that is increased in each time step 
by p t . If </>t is larger than the model length </> max a new sample is initialized. 
Update: Determination of the weights n[ n ' > based on p(z t |s^). 

Using the weighted samples obtained by these steps the classification of activities 
can be achieved. The probability that a certain model //, is completed at time t is given 
by its so-called end-probability p en d{pi)- This end probability is the sum of all weights 
of a specific activity model with (f) t > 0.9^ max . 

For the overall recognition system the repertoire of activities consists of approach 
and rest. The model rest is used to model the time periods where the hand is not moving 
at all. With these models the trajectory-based recognition of deictic gestures can be 
performed. 

5 Recognition of Pointing Actions 

As mentioned in the introduction a deictic gesture is always performed to reference an 
object more or less in the vicinity of the hand. To extract this fundamental information 
from the gesture, both the movement of the hand represented by the trajectory and 
symbolic data describing the object have to be combined. This combination is necessary 
if several objects are present in the scene as only using the distance between the hand 
and an object is not sufficient for detecting a pointing gesture. The hand may be in the 
vicinity of several objects but the object referenced by the pointing gesture depends on 
the direction of the hand motion. This area where an object can be expected in the spatial 
context of an action is called context area. 

In order to have a variable context area we extend the model vector x t (Eq. 1) by 
adding parameters for this area. It is defined as a circle segment with a search radius c r 
and a direction range, limited by a start and end angle (c a . eg). These parameters are 
visualized in Fig. 2. The angles are interpreted relative to the direction of the tracked 
hand. The approach model consists of some time steps with increasing velocity but 
without a context area in the beginning later in the model a context area is defined with 
a shrinking distance c r and the hand slows down. 

To search objects in a context area relative to the hand position the absolute position 
( P x ,P y ) of the hand is required. According to this demand the complete input data 
consists of the observed motion data z t and the coordinates P x ,P y . 
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The spatial context defined in the models is incorporated in the Condensation 
algorithm as follows. In each time-step the trajectory and context data is sequentially 
processed for every sample. At first the values of the sample are predicted based on 
the activity of the hand, afterwards the symbolic object data in relation to the hand is 
considered: 

If there are objects in the context area of the sample at the current time index <j> t 
one object is selected randomly. For adding this symbolic data to the samples of the 
Condensation we extend the sample vector St (Eq. 2) by a parameter ID f denoting a 
binding with a specific object: 

s t = (li t ,<f>t,att,pt, ID t ) (4) 

This binding is performed in the Update step of the Condensation algorithm. An 
object found in the context area is bound to the sample if no binding has occurred 
previously. Once the the sample s t contains an object ID it will be propagated with the 
sample using IdJ/' - ’ = [13|" 'j . 

Additional we extend the calculation of the sample weight with a multiplicative 
context factor P sym b representing how good the bound object fits the expected spatial 
context of the model. 

7r f * W ap(z t |sf ) ) P S j /m b(ID t |s| l) ) (5) 

For evaluating pointing gestures we use a constant factor for P sym b- The value of 
this factor depends on whether a previously bound object (i.e., with the correct ID) is 
present in the context area or not. We use P sym b = 1.0 if the expected object is present 
and a smaller value P sy mb = Pmissing if the context area does not contain the previously 
bound object. This leads to smaller weights 7 of samples with a missing context so 
that these samples are selected and propagated less often. 

When the threshold for the end probability J) r rid for one model is reached the pa- 
rameter ID is used for evaluating the object the human pointed at. One approach is to 
count the number of samples bound with an object. But this is an inaccurate indicator 
as all samples influence the result with the same weight. Assuming a large number of 
samples is bound with one object but the weight of these samples is small this will lead 
to a misinterpretation of the bound object. A better method is to select an object bound 
to samples with a high weight, as the weight of a sample describes how good it matches 
the trajectory in the last steps. Consequently, we calculate for each object Oj the sum 
pOj of the weights of all samples belonging to the recognized model p, that were bound 
to this object. 

N 

Po,(lH) = 

n= 1 

If the highest value po d (.Pi) for the model is larger than a defined percentage (To = 
30%) of the model end probability p e nd(l^i) the object Oj is selected as being the object 
that was pointed at by the ’pointing’ gesture. If the model has an optional spatial context 
and for all objects the end probability poA^i) i s lower than required the model is 
recognized without an object binding. 



r 7 T* t (n) , if Pi G 4 n) A (ft > 0.9^ max ) A ID* = Oj 

\ 0 , else 
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The benefit of the described approach is a robust recognition of deictic gestures 
combined with information about the referenced object. The system is able to detect not 
only deictic gestures performed in different directions but also provides the object the 
human pointed at. 



6 Results 

We evaluated the presented system in an experimental setup using 14 sequences of deictic 
gestures executed by five test subjects resulting in 84 pointing gestures. An observed 
person stands in front of a camera at a distance of approximately 2m so that the upper 
part of the body and the acting hand are in the field of view of the camera. The person 
points with the right hand at six objects (see Fig. 1), two on his right, three on his left 
side, and one object in front of the person. We assumed perfect object recognition results 
for the evaluation. For this evaluation only the localization of objects was needed, as 
pointing is independent of a specific object type. The images of size 320x240 pixels 
are recorded with a frame-rate of 15 images per second. In our experiments real-time 
recognition was achieved using a standard PC (Intel, 2.4GHz) running with Linux. The 
models were built by averaging over several example gestures. 

In the evaluation (see Tab. 1) we compare the results for different parameterizations 
of the gesture recognition algorithm. For evaluation we use not only the recognition 
rate but also the word error rate ( WER) which is defined by WER 1 = . As 

parameters for the Condensation we use N=1000 samples, the scaling factors a and p 
are between 0.65 and 1.35 with variance cr = 0.15. 



Table 1 . Recognition of deictic gestures 





Context j 




none 


distance 


directed 


weighted j 


p 

- 1 missing 


- 


1.0 


1.0 


0.8 


0.6 


0.4 


0.2 


0.1 


0.0 


Correct 


83 


69 


74 


72 


75 


77 


76 


78 


82 


Insertion 


81 


9 


5 


5 


5 


5 


6 


5 


18 


Deletion 


1 


10 


10 


12 


9 


7 


6 


6 


2 


Substitution 


0 


5 


0 


0 


0 


0 


0 


0 


0 


Word error rate 


97.6 


28.6 


17.8 


20.2 


16.7 


14.3 


14.3 


13,3 


23.8 


Recognition rate 


98.8 


82.2 


88.1 


85.7 


89.3 


91.7 


90.4 


92.8 


97.6 



The second column (’ none ) shows the results with the standard trajectory-based 
approach of Black et al. [2]. Without incorporation of the symbolic context no separa- 
tion between departing and approaching activities is possible, every straight motion is 
interpreted as pointing. Therefore, this approach gives the highest recognition rate but 
it also results in the highest WER due to a huge number of insertions. Note that there is 
also no information about which object is referenced by the pointing gesture. 

1 using FInsertion, D:Deletion, S: Substitution. E:Expected 
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By using the distance (column ’distance’) between the approaching hand and the 
surrounding objects mainly gestures approaching an object are recognized. But still a 
high rate of insertions and even substitutions (i.e., a wrong object binding) is observed. 
The substitutions show the disadvantage of a simple distance criterion that does not 
incorporate the direction of the hand motion. 

Using a directed context area (column ’ directed ) we achieve a better recognition 
rate and a lower WER. By introducing a weighting (columns 'weighted ) for samples 
not matching the expected context, the recognition rates can be further increased while 
reducing the WER. If samples not matching the context are deleted ( Pmissing = 0) the 
recognition rate is further increased but now also the WER is increased. This is due to 
the fact that all samples with a missing context area are deleted and indirectly those 
samples not matching the trajectory but with a bound object are propagated. 

7 Summary 

In this paper we presented an integrated approach to deictic gesture recognition that 
combines sensory trajectory data with the symbolic information of objects in the vicinity 
of the gesturing hand. Through the combined analysis of both types of data our approach 
reaches an increased robustness within real time. The recognition result provides not only 
the information that a deictic gesture has been performed, but also the object that has 
been pointed at. 
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Abstract. Although mosaics are well established as a compact and non- 
redundant representation of image sequences, their application still suf- 
fers from restrictions of the camera motion or has to deal with parallax 
errors. We present an approach that allows construction of mosaics from 
arbitrary motion of a head-mounted camera pair. As there are no par- 
allax errors when creating mosaics from planar objects, our approach 
first decomposes the scene into planar sub-scenes from stereo vision and 
creates a mosaic for each plane individually. The power of the presented 
mosaicing technique is evaluated in an office scenario, including the anal- 
ysis of the parallax error. 



1 Introduction and Motivation 

Mosaicing techniques are recently used in various different applications, even 
though the common basis is always to represent a sequence of images of a given 
scene in one image. Thus, mosaicing provides a compact, non-redundant repre- 
sentation of visual information. Besides the compression benefits from avoiding 
redundancy in mosaics, the larger field of view of the integrated mosaic im- 
age serves as a better representation of the scene than the single image data, 
for instance for object recognition or scene interpretation. But recent mosaic- 
ing techniques have restrictions. The main problem for building a mosaic of a 
non-planar scene is the occurrence of parallax effects as soon as the camera is 
moving arbitrarily. Parallax describes the relative displacement of an object as 
seen from different point of views. Each plane of the scene will move in a different 
relative speed in respect to each other and cause overlaps as soon as the camera 
center is moved. Therefore, the construction of only a single mosaic of the scene 
will not succeed. An avenue to deal with this problem is to control the motion 
of the camera and restrict it to rotation and zooming or compute mosaics on 
the basis of adaptive manifolds. Another possibility is to apply the mosaicing 
on (approximately) planar sub-scenes, which is the central assumption for the 
technique presented in this paper. 

The mosaicing system provides visual information in terms of a pictorial 
memory as part of a cognitive vision system (CVS) which is applied in an office 
scenario[2]. This memory contains a compact visual representation of the scene. 
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The CVS uses two head- mounted cameras to access the visual outcome of the 
scene and a stereo video display for augmented reality [13], depicted in Fig. 1. 

As the stereo camera-pair is located at the user’s head, 
there is no control on the motion of the cameras. Thus, 
due to the parallax problem, it is not possible to create 
just one mosaic of the whole scene, but on almost planar 
parts. This restriction appears acceptable in an office en- 
vironment since most of the objects (e.g. tables, walls,. . . ) 
appear to have a rather planar nature. Therefore, we pro- 
pose to decompose the scene into planes and than built 
a mosaic for each plane individually. The resulting set of 
mosaics provides the needed compact representation of the 
office scene. 

This motivation leads to the two central aspects of our 
mosaicing system. First, a decomposition of the scene into 
planar sub-scenes has to be computed from stereo information, as explained in 
detail in Sec. 3.1. Second, the planes have to be tracked during the sequence 
and for each of the detected planes separate mosaics are created by registering 
them to a reference frame. How this is done is described in Sec. 3.2. Results from 
image sequences obtained in the office scenario are discussed in Sec. 4. 




2 Related Work 

A lot of research has been done on applications of Mosaicing [9,6] and improving 
their performance [14,11,10]. These approaches mainly focused on the conven- 
tional mosaicing method rather than on the restrictions. Most of these are linked 
with the occurrence of parallax effects. Approaches to make mosaicing invariant 
to any restrictions attempt to avoid parallax or use parallax explicitly. In order 
to overcome the restrictions for mosaicing, mosaics with parallax and layers with 
parallax [7] were introduced. In this case, additional information about the 3D 
structure is stored to take account of parallax and to make the construction of 
mosaic images more robust. Another approach [12] tries to present mosaicing as 
a progress of collecting strips to overcome most restrictions. The strip collection 
copes with the effects of parallax by generating dense intermediate views, but is 
still restricted to controlled translational parts in the motion of the camera. 

Baker et al. [1] describe an approach to represent a scene as a collection of 
planar layers calculated from depth maps. But in contrast to our algorithm, the 
focus is mainly on approximating the 3D structure of the scene than on mosaics. 



3 Mosaics of Planar Sub-scenes 

Constructing mosaics from image sequences consists of computing a transforma- 
tion from the coordinates of the current image to a reference system, warping 
the current image to the reference frame and integrating new pixel data into 
the mosaic. The warping function can easily be computed if the images were 
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Fig. 2. System overview 



acquired by a camera rotating about its fixed center or if the scene is planar. 
Under these restrictions, however, mosaicing is not suitable for all applications. 
For the more general case where the scene is not completely planar and the 
camera center is moving, a single transformation will not exist. But if the scene 
is partial planar, there will be several warping functions each of them relating 
different views of corresponding planar regions. This motivates to build not one 
but several mosaics: one for each planar sub-scene. Given stereo image data, 
mosaicing then becomes a three step procedure: 

1. Scene Decomposition: Segment the current stereo image pair into pixel 
regions depicting coplanar areas of the scene. 

2. Plane Motion Recovery: Recover motion of planar regions in order to 
calculate warping functions. 

3. Planar Mosaic Construction: Expand mosaics and integrate warped pla- 
nar regions. 

Fig. 2 gives an overview of this concept and the computational modules of the 
framework introduced here. Next, this framework shall be presented in detail. 

3.1 Scene Decomposition 

Since stereo data is available due to the design of the used AR gear, identifying 
planes in a scene is accomplished by means of the following four steps: 

1. Local Coplanar Grouping: Starting with extracted key points from a 
pair of images (e.g. by using the Harris detector [4]) and computing their 
correspondences using epipolar geometry, a plane hypothesis is represented 
by a local group of point matches forming a planar patch. 

2. Coplanar Grouping - Extension of local patch: Point matches outside 
the local patch are added to the plane if they satisfy the plane model. 

3. Constrained Plane Propagation: From a set of point matches, the plane 
is now extended to pixel regions which satisfy the plane model. The result is 
a dense match map of a plane which displays textured regions of the plane. 

4. Second plane propagation - A map of the plane: Finally regions with 
less texture are assigned to the next neighboring textured region. The result 
is a boolean map which tells whether a pixel is part of the plane or not. 
Conjuncting this map with the current image of the scene, yields a pixel 
representation of the plane which is suitable for mosaicing. 
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1) I oral C.oplanar Grouping ?) Fxtention ot local patch 3) Plane Propagation 

Fig. 3. Identification of pixels belonging to a plane using matched stereo-key-points. 



Fig. 3 illustrates this method. It shows the evolution of a single plane given a 
set of key points. In the first and second step, the plane is represented by a set 
of point matches. Black points indicate inlier points while pale points represent 
outliers. Note that the first two steps make use of a feature-based representation 
while the final steps result in an image-based representation of a plane. 

The feature-based steps of this method were introduced for the following 
reason: It is known that two images embedded in the same plane n are related 
by a 2D projective transformation (homography) H and that a homography 
is uniquely defined by four point matches (cf. [5]). However, after extracting 
key points from stereo images, any four matched points will define a plane. An 
important issue is thus to distinguish virtual planes from physical ones, which is 
done as follows: 

A plane hypothesis is defined as a pair (Mj, H t ) where M t is a set of point 
matches and H t a corresponding homography representing the plane model. The 
set of all point matches is denoted as M. The dominant plane n dominant of a 
scene is defined as the plane hypothesis which incorporates the largest amount 
of point correspondences, i.e. 



7T dominant = argmax 1 1 Mi 1 1 . 

7 Ti 

Plane candidates 7Tj are found by coplanar grouping of point matches using 
RANSAC [3]. By choosing the actually dominant plane hypothesis tt dominant 
and removing its point matches from M, we try to find the next dominant plane 
of the scene similarly until no new planes can be found or the maximum number 
of planes is reached. The result is a rough scene decomposition represented by a 
set of plane hypotheses. 

In order to avoid the extraction of virtual planes, we apply a local planarity 
constraint. By restricting the choice of the four points to random local image 
areas and fitting plane hypotheses to this patches, it is granted that extracted 
planes are at least locally planar. Then, local plane hypotheses are evaluated 
with respect to the total number of key points. The hypothesis that accords 
with most global matches is chosen for further processings. Fitting planes to 
local patches also allows to measure the planarity of planes: if the relation of 
outlier to inlier points is below a certain threshold hypotheses are rejected. 

Since planar surfaces in a scene may contain holes and as there might be re- 
gions in the scene for which we do not have enough information to assign them 
to a plane, we apply a pixel-based plane growing method to embed local dis- 
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Left 



Right 




Fig. 4. Homograpliies between the stereo frame sequence of a plane 

continuities. Based on the algorithm described in [8], we suggest an image-based 
propagation process which densities the plane hypotheses. This resembles clas- 
sical region growing methods for image segmentation. Instead of a homogeneity 
criterion, normalized cross correlation between point matches is used for region 
expansion. Starting from a set of matches with high textureness, the algorithm 
densities the matches to regions with less textureness. Expansion stops in regions 
which diverge from the reference lromography or have no texture. This restricts 
the propagation to regions which can be approximated by the plane hypothesis. 

So far, only the propagation of a single plane has been considered. Given a 
set of plane hypotheses, the idea is to start a competition between these planes. 
Therefore, each plane hypothesis is also associated with the best correlation 
score among all its point matches. Then, only the plane 7 n with the best point 
match pb est (a,b) is allowed to start a single propagation step. Therefore, the 
neighborhood N(a, b ) of point match Pbest is densified. The chosen plane provides 
its next best point match and the next iteration begins. The propagation stops 
if none of the planes has a point match left to be processed. 



3.2 Plane Motion Recovery and Planar Mosaic Construction 

For the construction of a planar mosaic the homographies H motlon between the 
different frames has to be computed to recover the motion of the camera. The 
motion of the feature points, that have been established in the decomposition 
stage, are also used to compute these homographies. Thus, the motion recovery 
performed for each plane can be divided into two steps: 

1. Tracking of plane points: Given a set of points on a plane, each point 
is tracked independently. Assuming that a point is moving with constant 
velocity, a linear first order prediction of a point is used. 

2. Recovering plane motion: The resulting point tracks T t = (p\_ 1) p\) are 

supposed to lie on the same plane. For two views of a plane, there exists a 
lromography (see Fig. 4) which relates p\_i to p\. Again, RANSAC 

is used for a robust estimation of this lromography. 

Furthermore, the tracked plane has to be updated in terms of integrating 
new points and removing the ones gone out of sight. Therefore the lromogra- 
plry H^ tereo is recomputed and new points are added if they fulfill the planarity 
constraint. 

Based on the interframe homographies H™ otlon all plane images are warped 
to the reference frame F[ of the mosaic. The integration computes the median 
of the warped frames to determine the value of the resulting mosaic pixel. 
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(c) Frame 20 (d) Frame 30 

Fig. 5. Decomposition of the scene in two planes: Each left images displays the tracked 
feature points. Respectively, on the right, the textured regions of the planes are shown. 



(a) Initial frame (b) Plane image (c) Final Mosaic 

Fig. 6. An example for occlusion elimination (detail view) 

4 Results 

The focus of the evaluation is on the quality and the consistency of the mosaics as 
they are the final result of the presented procedure. The integration of new pixel 
data into the mosaic strongly depends on the preprocessing steps, namely Scene 
Decomposition , and Plane Tracking. Especially the scene decomposition plays 
an important role as plane tracking is based on its results. Errors occurring in 
this processing step are spread to all the following stages and result in erroneous 
mosaics. Fig. 5 presents the result of the scene decomposition of a sequence 
in the office. The decomposition has been limited to two dominant planes to 
ease the evaluation. The desk has two planes which both are detected correctly. 
The tracked feature points are highlighted in each frame (left images) and the 
propagated planes are shown in different gray shadings (right images). Note, 
that in frame 00 only one plane is detected, but ten frames later further points 
and another plane is added and tracked from now on. Another positive effect of 
only integrating image parts that belong to the same plane into the mosaics is 
depicted in Fig. 6. Because the parcel in the foreground of the scene does not 
belong to the same plane as the table with the journal, it is omitted from the 
mosaic and the occlusion is eliminated. This allows to create complete views of 
partially occluded objects in the scene. 
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(b) Frame 95 




Fig. 8. The parallax error (center) is computed as difference between the single image 
of the tracked plane (left) and the warped mosaic (right). Errors appear as white dots. 



If the decomposition of the scene into planes would be perfect, one would 
expect no parallax error in the mosaic. But due to the just approximately planar 
nature of extracted sub-scenes, errors will occur, especially at the borders of flat 
objects (e.g. a flat book lying on the table) as well as at the edges of extracted 
planes. We calculated the relative parallax error e = 5/s to evaluate these effects, 
which is defined as the amount of pixel differences 5 the tracked plane of the 
frame and the so far integrated mosaic, normalized by the size s of the plane 
measured in pixels. For calculating that difference the mosaic is warped into the 
actual frame. In Fig. 7 the evolution of this error measure is plotted over the 
whole sequence which is partially shown in Fig. 8. As expected, the parallax 
error rate increased while the mosaic is growing, but even in the last frame 
95, errors only occur at the edges of the objects, as can be seen in the center 
image of Fig. 8(b). The computation of the mosaics (tracking, and updating 
the homographies) can be performed in real-time after the initialization or the 
update respectively of the planes is done. 

5 Conclusion 

We presented an unique approach to create mosaics for arbitrarily moving head- 
mounted cameras. The three stage architecture first decomposes the scene into 
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approximated planes using stereo information, which afterwards can be tracked 
and integrated to mosaics individually. This avoids the problem of parallax er- 
rors usually occurring from arbitrary motion and provides a compact and non- 
redundant representation of the scene. Furthermore, creating mosaics of the 
plane allows to eliminate occlusion, since objects blocking the sight on a plane 
are not integrated. This can for instance help object recognition systems and 
further scene interpretation in the Cognitive Vision System, this approach is 
part of. The proposed robust decomposition and tracking algorithms allow to 
apply the system in real office scene with common cameras. 
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Abstract. The Gaussian scale-space is a standard tool in image analy- 
sis. While continuous in theory, it is generally realized with fixed regular 
grids in practice. This prevents the use of algorithms which require con- 
tinuous and differentiable data and adaptive step size control, such as 
numerical path following. We propose an efficient continuous approxima- 
tion of the Gaussian scale-space that removes this restriction and opens 
up new ways to subpixel feature detection and scale adaptation. 



1 Introduction 

Smoothing with Gaussian functions and the Gaussian scale-space have become 
standard tools in low-level image analysis. They are routinely used for prepro- 
cessing, estimation of derivatives, and feature extraction. With few exceptions, 
theories about scale-space and scale-based feature detection are derived for con- 
tinuous, differentiable functions, but are then realized on discrete grids, e.g. 
by sampling the Gaussian kernel or replacing it with a discrete approximation 
(e.g. binomial filters, Lindeberg’s discrete analog [7], or recursive filters [4]). To 
save memory and time, images are often subsampled after a certain amount of 
smoothing as in a Gaussian pyramid [2] or hybrid pyramid [8] . These approaches 
always use grids whose sampling density is at most that of the original image. 
However, in [6] it was shown that a higher sampling density can be necessary in 
order to prevent information loss during image processing. Empirical evidence 
for improved feature detection on oversampled data was also reported by [10,9]. 

In this paper, we approach the sampling issue in a radical way: instead of 
working on a discrete representation, we propose an abstract data type that rep- 
resents the Gaussian scale-space as a function over the reals, i.e. as a continuous, 
differentiable mapping from R 2 x R + — > R, with given precision e. Algorithms can 
access this data structure at arbitrary coordinates, and the requested function 
values or derivatives are computed on demand. Even for very irregular access 
patterns efficiency remains reasonable, as all calculations are based on splines 
and thus require only simple operations in relatively small neighborhoods. 

By using a continuous approach, many difficult problems may find natu- 
ral solutions. Consider, for example, edge following and linking: powerful path 
following algorithms exist in the field of numerical analysis, but they require 
continuously differentiable functions. Convergence statements come in the form 
of assymptotic theorems (/ — /) 2 = 0(h n ), where / is the approximation of / 

C.E. Rasmussen et al. (Eds.): DAGM 2004, LNCS 3175, pp. 350—358, 2004. 
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Fig. 1 . Line junctions drawn on a grid usu- 
ally occupy more than a single pixel and 
have rather unpredictable shapes. This can 
only be prevented with a real-valued (vec- 
tor) representation. 



and h the sampling step. Thus, to guarantee a given accuracy, one must be able 
to adapt the sampling step locally. We have found indications that this may also 
be true in image analysis: in continuous image reconstructions single pixels are 
often intersected by more than one edge and may contain more than one critical 
point. In fact, some configurations, in particular junctions, are not in general 
correctly representable by any grid (fig. 1). The same applies to bifurcations of 
critical point trajectories encountered in scale selection [7] or edge focusing [1], 
Up to now, attempts to access images in real-valued coordinate systems have 
been based on simple interpolation schemes such as linear interpolation, low 
order polynomial fits, or the facet model [5,8,3]. However, these methods lead 
to discontinuities of the function values or the first derivatives at pixel borders, 
and algorithms requiring differentiability are not applicable. In contrast, we are 
defining a reconstruction that is everywhere differentiable (up to some order) in 
both the spatial and the scale directions. 



2 Continuity in the Spatial Coordinates 

For an observed discrete ID signal /*, the continuous Gaussian scale-space is 
defined as a family of continuous functions f a {x) obtained by: 

~ „ 1 _ x 2 
fa(x)=g a ®f= V 9a(x-i)fi with g a { x) = —==e ^ (1) 

v27T(J“ 

i=—oo 

Unfortunately, this expression cannot directly be used on computers because 
Gaussian kernels have infinite support and must be clipped to a finite window. 
No matter how large a window is chosen, a discontinuity is introduced at the 
window borders, and this causes severe errors in the derivatives [12]. [12] rec- 
ommends to remove the discontinuity of the windowed sampled Gaussian by 
interpolation with a spline. This is a special case of a more general strategy: 
first compute an intermediate discrete scale-space representation by means of 
some discrete prefilter, and then reconstruct a continuous scale-space from it by 
means of a spline. Splines are a natural choice for this task because they are easy 
to compute, achieve the highest order of differentiability for a given polynomial 
order, and have small support. The prefilter will be defined so that the net-result 
of the prefilter/spline combination approximates the true Gaussian as closely as 
possible. Ideally, we might require preservation of image structure (e.g. num- 
ber and location of extrema), but this is very difficult to formalize. Instead we 
minimize the squared error between the approximation and the desired function: 
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E[/cr] = / (fa- faYdx = (ga ® f - S n ® (fa * f)) 2 dx (2) 

J— OO J— oo 

where / CT is the approximation for scale a, p a the prefilter, s n an n th -order B- 
spline, and * vs. © distinguish discrete from continuous convolution. This mini- 
mization problem is still intractable in the spatial domain, but due to Parseval’s 
theorem it can also be formulated and solved (with minor simplifications) in the 
Fourier domain: 



/ OO 

(G a - S n Pa) 2 F 2 du (3) 

-oo 

where G a = e - " 2 " 2 / 2 , S n = ( Sm i/2 ^ ) and Ar are the Fourier transforms 

of the Gaussian, the spline, and the prefilter. The spectrum F of the original 
image is of course unknown. We use the common choice F = 1, i.e. a white noise 
spectrum, where no frequency is preferred. While other possibilities exist (e.g. 
natural image statistics), this doesn’t significantly alter the optimal filter choice. 

We have compared many different prefilters and report some of them below. 
To realize the suggestion of [12] the prefilter P a must be the combination of a 
sampled windowed Gaussian and the direct spline transform [11] which ensures 
that the subsequent continuous convolution with the B-spline S n (indirect spline 
transform) indeed interpolates the Gaussian’s sample values: 



p(i) = ^ 



with S 3 = 



4 + 2 cos (u) 

6 ' 



S 5 = 



66 + 52 cos(it) + 2 cos(2 u) 
120 



(4) 



G a (the transfer function of a sampled and windowed Gaussian) can be derived 
by using well-known properties of the Fourier transform: Windowing with a 
box function of radius w in the spatial domain corresponds to convolution with 
a scaled sinc-function in the Fourier domain. Spatial sampling with step size 
h = 1 then leads to spectrum repetition at all multiples of 27r. Unfortunately, the 
resulting infinite sum is intractable. However, in the product S n P a the B-spline 
transfer function effectively supresses the spectrum of the prefilter for u > 27T, 
so that only the first spectrum repetition at ±27 t needs to be considered, and 
the effect of windowing can be neglected if w > 3cr. Thus, 

G a ~ + g-UG 2 /2 + e -( u -2nfa 2 /2 (5) 



A simpler prefilter Pj 2) is obtained by noticing that 1 / S n acts as a sharpening 
filter that exactly counters the smoothing effect of the indirect spline transform 

S n at the sampling points. When we apply the sampled Gaussian G a at a smaller 

^ ( 2^ A 

scale cr < cr, we can drop this sharpening, i.e. P„ ' = G a '- Further we replaced 
G a with approximate Gaussians: binomial filters, Deriche’s recursive filters [4], 
Lindeberg’s discrete analogue of the Gaussian [7], and the smoothing spline filter 
from [11]. Space doesn’t allow to give all transfer functions here. An even simpler 
idea is to drop the prefilter altogether, and stretch the B-spline instead so that its 




Accurate and Efficient Approximation 353 




binomial 
scaled spline 
discrete Gaussian 
sampled Gaussian 
interpolated sampled Gaussian 
oversampled Gaussian 
interpolated oversampled Gaussian 



Fig. 2. Scale normalized RMS residuals for Gaussian scale-space approximation with 
3 rd -order (left) and 5 th -order (right) splines for various prefilters and scales. 

z- / o\ 

variance matches that of the desired Gaussian: S rljCr > (u) = S n (a'u), P„ = 1. All 
possibilities mentioned so far perform poorly at small scales (a < 1), so we also 
tested oversampled Gaussians as prefilters, i.e. sampled Gaussians with sampling 
step h = 1/2 whose transfer functions are (the up-arrow denotes oversampling) : 



£$(u) = H 1] («/ 2) £$(«) = H 2) («/2) (6) 

and the B-spline transfer function must be accordingly stretched to S n (u/ 2). 

Figure 2 presents the scale normalized root mean square residuals op E of 
the minimization problem for our prefilter variants at variuous scales. The RMS 
directly corresponds to the expected error in the spatial domain, and scale nor- 
malization is applied in order to make residuals comparable over scales. It can 
be seen that oversampled Gaussians give the best results, and interpolation (use 
of P^P instead of P^P) only improves 5 th -order spline results. At scales cr > p2, 
non-oversampling Gaussians also achieve errors below ~ 10 — 3 , which can be con- 
sidered as good enough for practical applications (it roughly equals the quan- 
tization noise for 256 gray levels). We also repeated this analysis with the first 
and second derivatives of the Gaussian, with essentially the same results. 



3 Continuity in Space with Subsampling 



So far the resolution of the intermediate images was fixed. Considering that 
neighboring sampling points become more and more redundant as scale increases, 
this is rather inefficient, especially for higher dimensional data. We now replace 
the intermediate representation with a pyramid and analyse the residuals as 
a function of the scale where subsampling is performed. Usually, subsampling 
in a pyramid scheme is done by simply dropping every other sampling point. 
However, in the context of splines we can do better: Since the function space of 
possible splines with a given sample distance is a strict superset of the function 
space at half that distance, one can define an orthogonal projection from one 
space to the other. This projection can be realized by applying a projection filter 
before dropping samples [11]. The projection filter can be derived analytically, 
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sp3, subsampling 
sp5, subsampling 

sp3, subsampling and interpolation 
sp5, subsampling and interpolation 
sp3, 3rd order projection 
sp5, 5th order projection 
sp5, 3rd order projection 



Fig. 3. Scale normalized 
RMS residuals for vari- 
ous prefilters with sub- 
sampling, as a function of 
the subsampling scale. 



and its transfer function for 3 rd -order splines is 



n 3 (u) = (7) 

12132 + 18482 cos(u) + 7904cos(2u) + 1677 cos(3 u) + 124cos(4w) + cos(5w) 
16(1208 + 1191 cos(2 u) + 120 cos(4w) + cos(6u)) 

i.e. a combination of a 5 th -order FIR and a 3 rd -order HR filter. It is important 
to note that this filter preserves the average gray value (I7 3 (0) = 1), and ful- 
fills the equal contribution condition, i.e. the even and odd samples have equal 
total weights ( II 3 (n ) = 0). If used alone, the projection approximates the ideal 
lowpass filter (the Fourier transform of the sine interpolator) but this causes se- 
vere ringing artifacts in the reduced images. This is avoided when the projection 
filter is combined with one of the smoothing prefilters P a . To derive their com- 
bined transfer functions, recall that 2— fold subsampling in space corresponds to 
a spectrum repetition at 7r in the Fourier domain. The projection filter is op- 
tionally applied before subsampling. The subsampled prefilter transfer function 
is multiplied with the transfer function of a scaled B-spline S n (2u) (below, j_fc 
means that the approximation resulted from 2 fc -fold subsampling): 

Pa\i( u ) = H l \o( u ) + Pa\o( u - (without projection) (8) 

Pa'll (u) = n n (u)P^\ 0 (u) + n n (u- 7T )P^lo(u - 7 r) (with proj.) (9) 

G ail {u) = S n {2u)py il {u) or G^i(u) = S n (2u)P^(u) (10) 

For higher levels k of the pyramid, this process is repeated recursively, with 
spectrum repetitions at 7r/2 fc_1 , and splines scaled to S n (2 k u). Figure 3 depicts 
the scale normalized RMS errors for a single downsampling step as a function of 
the scale where the downsampling occurs, for various prefilters (with optimized 
a' and with or without the projection filter). It can be seen that an error of 0.01 
is achieved for the 3 ld -order spline without projection at cr ~ 2, and an error of 
0.001 for the 5 th -order spline with projection at cr ~ 2.4. Instead of the rather 
expensive 5 th -order projection filter, 3 rd -orcler projection has been used for 5 th - 
order splines as well, with only a marginal increase in error. Further analysis 
showed that roughly the same accuracy levels are maintained if subsampling is 
repeated in the same manner at octave intervals. 



4 Continuity in the Scale Direction 

If one wants to improve feature detection by means of scale selection or coarse- 
to-fine tracking, function values or derivatives at arbitrary scales rather than 
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a[s] 

b[s]=l-a[s] 

c[s] 



Fig. 4. The blending 
functions for scale inter- 
polation. 



at precomputed ones are often needed. If one uses simple interpolation schemes 
such as rounding to the nearest scale, linear interpolation or parabola fitting, 
the true Gaussian scale-space is not approximated very well, and the resulting 
representation is not differentiable with respect to scale. A much better interpo- 
lation scheme can be derived by looking at the diffusion equation whose solution 
for a given initial image is precisely the Gaussian scale-space 



df = Id 2 / 

dr 2 dx 2 ’ 



(r = a 2 ) 



( 11 ) 



According to this equation the smoothed image at some scale r + e can be cal- 
culated from the image at scale r and the corrsponding second derivative by 
f T+f (x) = f T (x) + ef"(x) if e is small. This suggests that a better interpola- 
tion scheme can be defined by a linear combination of smoothed images and 
second derivatives (Laplacians in higher dimensions) at two neighboring scales. 
In particular this means that a Gaussian at scale a can be interpolated by: 



d 2 d 2 

9c r{x) ~ a{a)g ai (x) + b(a)g a2 (x) + c{a)-^g ai {x) + d(a)-^g a2 (x) (12) 



with ct i < a < cr 2 . In order for the interpolation to preserve the average gray 
value, we must require b(a) = 1 — a(a). Since the same relationship holds in the 
Fourier domain, we can again formulate a least squares minimization problem 



E[a, c, d] = 



( G a (u ) — G a (u)) 2 ududtp 



0 J — oo 



(13) 



Note that we defined the residual in 2D polar coordinates because this lead to a 
simpler functional form than the ID formulation and to higher accuracy in 2D. 
Setting the derivatives with respect to a , c and d to zero leads to a linear system 
for the interpolation coefficients. If cr 2 = 2cri, the solution to this system is 

Xi = o- 2 /al, \2 = 77 -. A— T, 

(1 + Xi)(4 + Xi) 

a = (62 + x 2 (-10560 + x 2 (32000 + 72800xi)))/54 (14) 

c = ct 2 (15 + X2(-2700 + X2(6000 + 19500xi)))/54 (15) 

d = af ( 240 + X2(-28800 + x 2 (96000 + 168000xi)))/54 (16) 



This is indeed a continuous, differentiable interpolation scheme, as the original 
Gaussians are recovered at the interpolation borders, and the diffusion equation 
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is fulfilled there, i.e g a (x) ^= 0 - 1,2 = g, r lt2 (x) and d T g r7 (x)\ a =a 1 , 2 = d xx g ai 2 (x)/2. 
It is somewhat surprising that simple least squares error minimization results in 
blending formulas which fulfill these requirements, because this was not enforced 
during the derivation. Probably there is a (yet to be discovered) deeper reason 
behind this. The accuracy of the scale interpolation scheme is very high. The 
maximum scale normalized RMS error is 4.5 x 10“ 3 and is reached at a = 
1.398cti. If desired, the error can be reduced by an order of magnitude if 02 = 
\/2o\ is chosen. Figure 4 depicts the blending functions a, b , c and d. Derivatives 
are interpolated likewise by replacing g ai 2 with the derivative and using its 
Laplacian. Derivative interpolation thus requires splines of at least order 5. 



5 Results and Conclusions 

Our analysis suggests that an accurate continuous scale-space approximation 
can be obtained in two phases: First, an intermediate pyramid representation is 
computed by means of some optimized discrete filter. Second, function values 
and derivatives at arbitrary real-valued coordinates and scales are calculated on 
demand, using spline reconstruction and scale interpolation. These procedures 
can be encapsulated in an abstract data type, so that algorithms never see the 
complications behind the calculations. The scale-space starts at base scale abase 
which should be at least 0.5. The Gaussian should be windowed at w > 3 a. 

Phase 1: Intermediate Pyramid Representation 

1. Pyramid level "-1” (scale abase): Convolve original image with oversampled Gaus- 
sian g c Optionally apply the direct spline transform (interpolation prefilter). 

2. Level "0" (scale 2ab ase ): Convolve original image with sampled Gaussian g ao . 
Optionally apply the direct spline transform. 

3. Level ”1" (scale 4ab ase ): Convolve original image with sampled Gaussian g ai . 
Optionally apply the projection filter. Drop odd samples. 

4. Level " k" ( k > 1): Convolve the intermediate image at level k — 1 with sampled 
Gaussian g a2 . Optionally apply the projection filter. Drop odd samples. 

The optimal values for a_i, ..., a 2 depend on the order of the spline used, on the 
value of a base and on whether or not the interpolation/projection prefilters are 
applied. Table 1 gives the values for some useful choices. They were calculated 
by minimizing the scale normalized RMS error between the approximation and 
the true Gaussian. It can be seen (last column) that these errors decrease for 
higher order splines, larger abase and use of interpolation/projection. 

Phase 2: On-demand Calculation of Function Values or Derivatives at ( x , y , a) 

1. If a = 2 k+1 a b a se (fc > —1): Work on level k of the intermediate representation. 
Calculate spline coefficients for ( Sx,6y ) = (x/2 k ,y/2 k ) — (\x/2 k \ 1 [y/2 k \) and 
convolve with the appropriate image window around (]x/2 k \, \jj/2 k \). 

2. If 2 fc+1 ab ase < a < 2 fc+2 ab ase ( k > —1): Use the algorithm from Phase 2.1 to 
calculate function values and corresponding Laplacians at levels k and k + 1. 
Use the scale interpolation formula to interpolate to scale a. 
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Table 1. Optimal scales for sampled Gaussian prefilters for various algorithm variants. 
’’Optional interpolation” refers to levels -1 and 0, ’’optional projection” (always with 
3 rd -order projection filter) to levels 1 and higher. 



algorithm variant 


f^base 


c_i 


no 


Cl 


c 2 


max. resid. 


3 rd -order spline without 


1/2 


0.4076 


0.8152 


1.6304 


1.4121 


0.018 


interpolation / projection 


0.6 


0.5249 


1.0498 


2.0995 


1.8183 


0.0070 




V2/2 


0.6448 


1.2896 


2.5793 


2.2337 


0.0031 


5 th -order spline without 


1/2 


0.3531 


0.7062 


1.4124 


1.0586 


0.017 


interpolation / projection 


0.6 


0.4829 


0.9658 


1.9316 


1.6728 


0.0035 




V2/2 


0.6113 


1.2226 


2.4451 


2.1175 


0.0018 


5 th -order spline with 


1/2 


0.4994 


0.9987 


1.7265 


1.5771 


0.0062 


interpolation / projection 


0.6 


0.5998 


1.1996 


2.1790 


1.9525 


0.0025 




V2/2 


0.7070 


1.4141 


2.6442 


2.3441 


0.0009 



The computation time for a single point during phase 2 is independent of the 
image size. It involves only additions and multiplications (in roughly equal pro- 
portions). If a coincides with one of the precalculated levels, we need 44 multi- 
plications per point for a 3 ld -order spline and 102 for a 5 th -order one. When an 
intermediate scale must be interpolated, the numbers are 154 and 342 respec- 
tively. Derivative calculations are cheaper as the polynomial order of the splines 
reduces. When the data are accessed in a fixed order rather than randomly, the 
effort significantly decreases because intermediate results can be reused. On a 
modern machine (2.5 GHz Pentium), our implementation provides about a mil- 
lion random point accesses per second for the 5 th -order spline. While this is not 
suitable for real time processing, it is fast enough for practical applications. 

In the future, we will apply the new method to design high-quality subpixel fea- 
ture detectors. Preliminary results (which we cannot report here due to space) 
are very encouraging. We also believe that a continuous scale-space representa- 
tion will open up new roads to scale selection and scale adaptation. For example, 
variable resolution as in the human eye can be achieved by simply using a posi- 
tion dependent scale instead of an irregular (e.g. log-polar) sampling grid. 
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Abstract. We describe a method for selecting optimal actions affecting 
the sensors in a probabilistic state estimation framework, with an ap- 
plication in selecting optimal zoom levels for a motor-controlled camera 
in an object tracking task. The action is selected to minimize the ex- 
pected entropy of the state estimate. The contribution of this paper is 
the ability to incorporate varying costs into the action selection process 
by looking multiple steps into the future. The optimal action sequence 
then minimizes both the expected entropy and the costs it incurs. This 
method is then tested with an object tracking simulation, showing the 
benefits of multi-step versus single-step action selection in cases where 
the cameras’ zoom control motor is insufficiently fast. 



1 Introduction 

This paper describes a method for selecting optimal actions which affect the 
sensors in a probabilistic state estimation framework. The contribution of this 
paper is the ability to incorporate varying costs into the action selection process 
by looking multiple steps into the future. 

Probabilistic state estimation systems continuously estimate the current state 
of a dynamic system based on observations they receive, and maintain this esti- 
mate in the form of a probability density function. Given the possibility to affect 
the observation process with certain actions, what are the optimal actions, in an 
information theoretic sense, that the estimation system should choose to influ- 
ence the resulting probability density? 

One sample application is the selection of optimal camera actions in motor- 
operated cameras for an active object tracking task, such as pan and tilt opera- 
tions or zooming. We examine focal length selection as our sample application, 
using an extended Kalman filter for state estimation. 

* This work was partly funded by the German Research Foundation (DFG) under 
grant SFB 603/TP B2. Only the authors are responsible for the content. 
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Previous work in the areas of object recognition [10,4,3] have shown that an 
active viewpoint selection process can reduce uncertainty. For object tracking, 
active focal length selection is used to keep the target’s scale constant [6,11]. Yet 
the focus of these works is not to find the optimal zoom level. 

The information theoretic solution described in [5] , which this work is based 
on, uses the entropy of the estimated state distribution. This system calculates 
the expected entropy for each action, and then chooses the action where the 
expected entropy is lowest. 

However, this approach only works if all actions are considered equal. If the 
actions incur costs which may depend on the last action, examining the expected 
benefit of just a single action is no longer sufficient. In the example of focal length 
selection, the zoom lens motor has only a finite speed. A too high zoom level can 
cause the object to be lost when it approaches the edges of the camera image 
faster than the zoom motor can follow. 

The solution is to obtain the best sequence of future actions, and to calculate 
the costs and benefits of the sequence as a whole. In our case of a motorized 
zoom lens, the tracker is able to reduce the focal length in advance, in order for 
the low focal length to actually be available in the time frame where it is needed. 

In simulated experiments with slow zoom motors, up to 82% less object loss 
was experienced, as compared to the original single-step method. This reduced 
the overall state estimation error by up to 56%. 

The next section contains a short review of the Kalman filter and the notation 
used in this paper. Section 3 simultaneously reviews the single-step method from 
[5] and shows how to extend it to multiple steps, the main contribution of this 
paper. The method is evaluated in section 4, and section 5 concludes the paper 
and gives an outlook for future work. 

2 Review: Kalman Filter 

As in [5], we operate on the following discrete-time dynamic system: At time t, 
the state of the system is described in the state vector x t £ IR”, which generates 
an observation o t £ IR m . The state change and observation equations are 



X t+ 1 = f(x t ,t) + w , o t = h(x t ,a t ) + r ( 1 ) 

where /(•,•) £ IR™ is the state transition function and £ IR m the obser- 

vation function, w and r are normal zero-mean error processes with covariance 
matrices W and R. 

The parameter a t £ Bx is called the action at time t. It summarizes all the 
parameters which affect the observation process. For object tracking, a t might 
include the pan, tilt and the focal length of each camera. The action is performed 
before the observation is made. 

The task of the state estimator is to continuously calculate the distribution 
p(x t \(o)t,(a)t) over the state, given the sequence (o) t of all observations and 
the sequence (a) t of all actions taken up to, and including, time t. 
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Assuming the action is (for now) known and constant, the Kalman filter [8], 
a standard algorithm, can be used for state estimation. Since the observation 
function is based on the non-linear perspective projection model, an extended 
Kalman filter [1] is necessary. A full description of the extended Kalman filter is 
beyond the scope of this paper. We use the following notation for the filter: xf 
and x+ are the a priori and a posteriori state estimate means at time t. Pf and 
P + are the covariance matrices for the a priori and a posteriori state estimates. 
The extended Kalman filter performs the following steps for each time-step t : 

1. State mean and covariance prediction: 

x;=f(x t _ 1 ,t- 1) , Pf = ffP t _jf + W . (2) 

2. Computation of the filter gain: 

K t = Pfhf(a t )(h?(a t )P^hf(a t ) + R ) . (3) 

3. State mean and covariance update by incorporating the observation 

x+ = x;+K t (o t -h(xi,a t )) , P+(a t ) = (I-K t hf(a t ))Pf . (4) 

ff and hf(a t ) denote the Jacobians of the state transition and observation 
functions. Since the observation Jacobian hf(a t ) depends on the selected action 
a t , the a posteriori state covariance does, too. In cases where no observation is 
made in a time step, the a posteriori state estimate is equal to the a priori one. 

3 Multi-step Optimal Actions 

The method described in [5] uses the entropy of the state distribution to select 
the next action for a single step in the future. The single-step approach works 
well if the optimal action can be performed at each time-step. Often, however, 
there will be real-world constraints on which actions are possible; for example, 
cameras with a motorized zoom lens can only change their focal lengths at a 
finite maximal speed. In general, we say that an action, or a sequence of actions, 
incurs a cost. This cost must be subtracted from the expected benefits of the 
actions to find the truly optimal actions. 

In the case of focal length selection, the single-step method will often select 
a large focal length when the object is in the center of the camera image. Once 
the object moves towards the edge, a lower focal length is needed in order not 
to lose the object; this focal length may be to far for the zoom motors. The 
multi-step method, evaluating a sequence of actions, will detect the need for a 
low focal length sooner, and will start reducing the focal length ahead of time. 

To evaluate an action, we use the entropy [2] of the state distribution as a 
measure of uncertainty. This measure was used in [5] to select a single action. 
We will show how this method can be expanded to a sequence of actions. 

To evaluate a sequence of actions, we measure the entropy of the state dis- 
tribution at the horizon. The horizon k is the number of steps to be looked 
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ahead, starting at time step t. For the single-step variant, k = 1. We denote the 
sequences of future actions and observations, occurring between time steps t + 1 
and t + k, as (a) k and (o) fc , respectively. 

The entropy of the a posteriori state belief p(x t+k \(o) t+k , (a) t+k ) is 

H (x+ +k ) = - j p(x t+k \(o) t+k , (a) t+k ) \og(p(x t+k \(o) t+k , (a) t+k ))dx t+k . (5) 

This gives us information about the final a posteriori uncertainty, provided ac- 
tions ( a) k were taken and observations (o) k were observed. 

However, to determine the optimal actions before the observations are made, 
this measure cannot be used directly. Instead, we determine the expected entropy, 
given actions ( a) k , by averaging over all observation sequences: 

H(x t+k \(o) k , (a) k ) = J p((o) k \(a) k )H(x+ +k ) d ( o) k . (6) 

This value is called the conditional entropy [2]. The notation H(x t \o t ,a t ) is 
misleading, but conforms to that used in information theory textbooks. The 
only free parameter is the action sequence ( a) k . The optimal action sequence 
can then be found by minimizing the conditional entropy. 

In the case of a Gaussian distibution, as is used throughout the Kalman filter, 
the entropy takes the following closed form: 

H(x t+k \(o) k ,(a) k ) =j p((o) k \(a) k ) ^ + ^ log ((27r)™|P^ | _ fc ((a) fc )|)j d(o) k . 

( 7 ) 

We note that only p((o) k \(a) k ) depends on the integrand ( o) k , the covariance 
Pt+k(( a ) k ) does n °t- This allows us to place everything else outside the integra- 
tion, which then intergrates over a probability density function and is therefore 
always 1. Therefore, we only need to obtain the a posteriori covariance matrix 
Pf+k evaluate an action sequence, which means stepping through the Kalman 
filter equations k times. Since we do not have any future observations o, the state 
estimate mean x can only be updated with the expected observation h(x , a), 
which reduces equation (4) to S + = x~ + 0 . The state estimate mean allows 
us to calculate all used Jacobians for equations (2) and (3), which give us all 
covariance matrices P and P + for any future time step. 

In cases where an observation is not guaranteed, the final entropy is based on 
either the a posteriori or the a priori covariance matrix. The conditional entropy 
must take this into account. We define an observation to be either visible or 
non-visible. For example, in the case of object tracking, an observation is visible 
if it falls on the image plane of both cameras, and non-visible otherwise. It is 
important to note that a non-visible observation is still an element of the set 
of all observations. For a single step, splitting the observations into visible and 
non-visible ones results in the following entropy: 

H(x t \o t ,a t )= J p(o t \a t )H v (x+)do t + J p{o t \a t )H^ v {xf)do t (8) 



{o t visible} 



{o t — 'visible} 
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In the Kalman filter case, where H v (xf) and H~, v (x^) do not depend on o t , 
they can again be moved outside the integration. The remaining integrations 
now reflect the probability of a visible (w 1 ) or non- visible (w 2 ) observation: 

H(x t \o tl a t ) = wi ■ H v (xf) + w 2 ■ H^ v (x^) (9) 

W\ and w 2 can be solved efficiently using the Gaussian error function [5]. 

In the multi-step case with a horizon of k, there are 2 fc different cases of 
visibility, since an observation may be visible or not at each time step, and 
hence 2 k different possible entropies must be combined. If we can calculate the 
probability and the a posteriori entropy at step t+k for each case, we can again 
obtain the conditional entropy by a weighted sum: 

H(x t \o t ,a t ) = w 

VV...V Hvv...v(?^t) “h W VV...7X H VV ...n(Xt) 

T • • • T ^nn...nHnn...ni,Xt) (10) 

where vv . . . v denotes the case where every time step yields a visible observation, 
vv . . . n denotes all visible except for the last, and so on. For such a sequence of 
visibilities, the probabilities and covariance matrices can be calculated by using 
the a priori or a posteriori covariance from the previous step as the starting 
point, and proceeding as in the single-step case. 

This can be summarized in a recursive algorithm: For time step l, starting 
at l = 1, the Kalman filter equations use the current action (it+i to produce 
the correct state mean (x+ +l ) and covariance {P^ +l , Pt+i) predictions for both 
cases of visibility, as well as the probabilities w\ and w 2 for each case. If l = k, 
the conditional entropy is calculated as in equation (9), using entropies obtained 
from both covariance matrices through equation (7). Otherwise, this procedure 
is repeated twice for time l + 1: once using Pf +1 as its basis for the visible case, 
and once using P^ + 1 . Both repetitions (eventually) return a conditional entropy 
for all steps beyond l, and these are combined according to w\ and w 2 into the 
conditional entropy for time step l to be returned. 

4 Experiments 

This algorithm was evaluated in a simulated object tracking system. Current 
computational restrictions make a meaningful evaluation in a real-world envi- 
ronment impossible, since the insufficient speed of the zoom motors, a key aspect 
of the problem, is no longer present. 

The following simulated setup, as shown in figure 1, was used: The target 
object follows a circular pathway. The sensors are two cameras with parallel lines 
of sight and a variable focal length. The cameras are 200 units apart. The center 
of the object’s path is centered between the two cameras, at a distance of 1500 
units, its radius is 200 units. 

Simulations were performed with horizons of 1, 2, 3 and 4, and with zoom 
motor speeds 3, 4 and 5 motor steps per time step, for a total of 12 different 
experiments. Each experiment tracked the object for 10 full rotations in 720 
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Fig. 1. Simulation setup. The object moves on a circular path. At points a and ( 5 , 
object loss may occur due to limited zoom motor speed. 

time steps. For comparison, one experiment was also conducted with fixed focal 
lengths, and one with unlimited motor speed. In our implementation, a Pentium 
processor at 2.8 GHz takes less than two minutes for a horizon length of 1 
(including output). An experiment with horizon length 4 takes about 6 hours. 
This implementation interates over the entire action space, without concern for 
efficiency. Section 5 lists several enhancements with the potential for drastic 
speed increases to possibly real-time levels. 

Figure 2 (left) shows the number of time steps with visible observations, out 
of a total of 720, for each experiment. The lower the value, the longer the object 
was lost. The object was typically lost near points a or /3 in figure 1, at which 
the object approaches the border of a camera’s image plane faster than the zoom 
motor can reduce the focal length. 

Figure 2 (right) shows the actual focal lengths selected by the lower camera 
in figure 1. Two cycles from the middle of the experiments are shown. The 
experiments being compared both use a motor zoom speed of 3, and a horizon 
length of 1 and 4. Additionally, the focal lengths which occur when the zoom 
motor speed is unlimited are shown. One can see that a larger horizon produces 
similar focal lengths to a single-step system, but it can react sooner. This is 
visible between time steps 190 and 210, where the four-step lookahead system 
starts reducing the focal length ahead of the single-step variant. This results 
in reduced object loss. The plateaus at time steps 170 and 240 result from the 
object being lost in the other camera, increasing the state uncertainty. 

Table 1, lastly, shows the mean state estimation error, as compared to the 
ground truth state. The advantage of a multi-step system is greatest in the case 
of a slow zoom motor (top row), where the increased probability of a valid ob- 
servation more than makes up for the slight increase in information which the 
single-step system obtains with its larger focal lengths. This advantage dimin- 
ishes once the zoom motors are fast enough to keep up with the object. The 
second-to-last row shows the mean error for a horizon of 1 and an unlimited mo- 
tor speed. This is the smallest error achievable by using variable focal lengths. 
The last row contains the mean error for the largest fixed focal length which 
suffered no object loss. An active zoom can reduce this error by up to 45%, but 
only if the zoom motor is fast enough to avoid most object loss. 
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Table 1. Mean error, in world units, for each of the 12 experiments. The last two 
rows show the results for an unlimited zoom motor speed, and a fixed focal length. A 
variable focal length approach is always superior to a fixed one, except for the special 
case of slow zoom motors. These cases can be caught by a multi-step lookahead. 



Zoom motor speed 


horizon 1 


horizon 2 


horizon 3 


horizon 4 


3 steps 


52.5 


33.7 


30.3 


23.3 


4 steps 


21.2 


20.4 


17.1 


16.1 


5 steps 


16.9 


16.9 


15.9 


16.1 


unlimited 


15.1 








fixed 


27.7 










bo 

a 

0 ) 




time steps 



Fig. 2. Left: Number of time steps with observations, from a total of 720, for each 
experiment. Lower values mean greater object loss. Right: Focal lengths for two object 
cycles at a zoom motor speed of 3 and horizons of 1 and 4. The focal lengths from an 
unlimited motor speed are also shown. 



5 Conclusion and Outlook 



The methods presented in this paper implement a new and fundemental method 
for selecting information theoretically optimal sensor actions, with respect to 
a varying cost model, by predicting the benefit of a given sequence of actions 
several steps into the future. For the example of focal length selection, we have 
shown that, given a small action range, this multi-step approach can alleviate the 
problems that the single-step method faces. In our experiments, we were able to 
reduce the fraction of time steps with no usable observation by over 80%, which 
in turn reduced the mean state estimation error by up to 56%. 

Future work will focus on reducing the computation time, to enable meaning- 
ful real-time experiments, and finally real-time applications, of multi-step action 
selection. For example, the results from common subexpressions, i.e. the first 
calculations for two action sequences with a common start, can be cached. 

Another optimization is to test only a subset of all possible action sequences, 
with optimization methods which only rely on point evaluation. Application 
dependent analysis of the topology of the optimization criterion, such as axis 
independence and local minimality, may allow more specialized optimization 
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methods. The efficiency may also be improved by intelligently pruning the eval- 
uation tree, for example using methods from artificial intelligence research, such 
as alpha-beta pruning [9], or research in multi- hypothesis Kalman filters [1]. 

Though this paper only outlined the procedure for use with a Kalman filter, 
the method should be general enough to apply to other estimation systems, for 
example particle filters [7]. This is non-trivial, since this work makes use of the 
fact that the entropies do not depend on the actual value of the observations. 
This is no longer the case with more general state estimators. 

Multiple camera actions have also been studied in object recognition [3] using 
reinforcement learning. The parallels between the reinforcement learning meth- 
ods and this work will be investigated. 

Lastly, these methods need to be evaluated for more general cost models, 
based on the “size” or “distance” of an action and not just on its feasibility. 
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Abstract. Reliable object detection and segmentation is crucial for ac- 
tive safety driver assistance applications. In urban areas where the object 
density is high, a segmentation based on a spatial criterion often fails due 
to small object distances. Therefore, optical flow estimates are combined 
with distance measurements of a Laserscanner in order to separate ob- 
jects with different motions even if their distance is vanishing. Results 
are presented on real measurements taken in potentially harmful traffic 
scenarios. 



1 Introduction 

The ARGOS project at the University of Ulm aims at a consistent dynamic 
description of the vehicles environment for future advanced safety applications 
such as automatic emergency braking, PreCraslr and pedestrian safety. A Laser- 
scanner and a video camera mounted on the test vehicle retrieve the necessary 
measurements of the vehicles environment [1]. 

The Laserscanner acquires a distance profile of the vehicles environment. 
Each measurement represents an object detection in 3d space. Because of the 
high reliability of object detection and the accurate distance measurements at 
a high angular resolution, the Laserscanner is well suited for object detection, 
tracking and classification [2]. However there are scenarios, especially in dense 
urban traffic where the algorithms fail. The Laserscanner tracking and clas- 
sification algorithms are based on a segmentation of the measurements. The 
measurements are clustered with respect to their distance. Objects which are 
close together are therefore wrongly recognised as a single segment. Thus, ob- 
ject tracking and classification are bound to be incorrect. 

A similar problem arises in stereo vision. In [3] stereo vision is combined with 
optical flow estimates in order to detect moving objects even if they are close to 
other stationary objects. However, the approach can not differentiate between 
two dissimilarly moving objects. Dang et al. developed an elegant Kalman Filter 
implementation for object tracking using stereo vision and optical flow [4]. This 
algorithm uses a feature tracking approach and can be used for image segmen- 
tation based on the object dynamics. 

Our approach aims at a correct Laserscanner based segmentation of objects 
even they are close together by analysing their motion pattern in the video image 
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domain. The segmentation criterion is therefore based on the distance between 
Laserscanner measurements and additionally the difference of the associated 
optical flow estimates in the video images. 

2 Sensors 

A Laserscanner and a monocular camera are combined in order to enable a reli- 
able environment recognition in distances of up to 80 m. The multi-layer Laser- 
scanner ALASCA (Automotive LAserSCAnner) of the company IBEO Automo- 
bile Sensor GmbH (Fig. 1) acquires distance profiles of the vehicles environment 
of up to 270° horizontal field of view at a variable scan frequency of 10 — 40 Hz. 
At 10 Hz the angular resolution is 0.25° with a single shot measurement standard 
deviation of a = 3 cm, thus enabling a precise distance profile of the vehicles 
environment. It uses four scan planes in order to compensate for the pitch angle 
of the ego vehicle. The Laserscanner ALASCA has been optimised for auto- 
motive application and performs robustly even in adverse weather conditions. 
The multi-layer Laserscanner is mounted at the front bumper of the test vehicle 
which reduces the horizontal field of view to 180°. 




Fig. 1 . The multi-layer Laserscanner ALASCA (Automotive LAserSCAnner) of the 
company IBEO Automobile Sensor GmbH. 



The monocular camera is mounted behind the windscreen beside the inner 
rear mirror. The camera is equipped with a 1/2” CCD chip which has a standard 
VGA resolution of 640x480 pixel. With a 8 mm lens a horizontal view of 44° is 
realised at an average angular resolution of 0.07° per pixel. 

In order to synchronise the sensors, the camera is triggered, when the rotating 
Laserscanner head is aligned with the direction of the optical axis of the camera. 
The sensors are calibrated in order to enable not only a temporal alignment given 
by the synchronisation but also a spatial alignment. By means of an accurate 
synchronisation and calibration, image regions can be associated directly with 
Laserscanner measurements. Therefore it is possible to assign certain image parts 
a distance, which is a major advantage of this fusion approach. 
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3 Laserscanner Based Segmentation 

In order to reduce the amount of data which has to be processed, the Laser- 
scanner measurements are combined to segments. The aim of the segmentation 
is to generate clusters which each represent an object in reality. Optimally there 
is exactly one segment per object and only one object per segment. This is, 
however, not always possible to realise. 

The segments are created based on a distance criterion. Measurements with a 
small distance to neighbouring measurements are included in the same segment. 
Both the x and y components of the distance d x and d y have to be below a cer- 
tain threshold do. For urban scenarios a sensible choice is do = 0.7 m. Especially 
in urban areas where the object density is high, two objects might be so close to- 
gether that all measurements on these objects are combined to a single segment. 
This is critical for object tracking and classification algorithms which are based 
on the segmentation. If the measurements of two objects are combined to one 
single segment the object tracking can not estimate the true velocity of the two 
objects which is especially severe if the objects exhibit different velocities (Fig. 
2). Additionally a classification of the object type (car, truck, pedestrian, small 
stationary objects and large stationary object) based on the segment dimensions 
is bound to be incorrect. 

However, reducing the threshold do results in an increase of objects which are 
represented by several segments. This object disintegration is difficult to handle 
using only Laserscanner measurements. To the authors knowledge there has not 
yet been suggested any real time Laserscanner object tracking algorithm which 
is robust against a strong object disintegration in urban scenarios. 




Fig. 2. Laserscanner based segmentation of a parking scenario at two time instances. 
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4 Spatio-temporal Segmentation Using Optical Flow 

In order to improve the Laserscanner based segmentation which uses a distance 
criterion, an additional criterion is introduced. Considering two consecutive im- 
ages the optical flow can be calculated for image regions which are associated 
with Laserscanner measurements. Using the optical flow as an additional segmen- 
tation criterion enables the differentiation between objects of diverging lateral 
motions even if they are close together. 

The optical flow f = ( f u /„ ) is calculated with a gradient based method [5,6, 
7]. In automotive applications, the ego motion component of the optical flow can 
be high even when using short measurement intervals. Therefore, a pyramidal 
optical flow estimation is applied in order to account for large displacements [8] . 

Two spatio-temporal segmentation algorithms have been developed the 
constructive and destructive segmentation. 



4.1 Constructive Segmentation 



The constructive approach changes the segmentation distance threshold 9q de- 
pending on the similarity of the assigned optical flow. Extending the optical flow 
vector without loss of generality with the time dimension 



Vti + fd + 1 




(i) 



the similarity of two optical flow vectors fi and f 2 is given by the angle ip between 
the vectors [5] 



ip =| arccos ^fi • |, with ip G [0,7r]. (2) 

This similarity measure ip is, however, biased towards large optical flow vectors 
f. Therefore the optical flow vectors are normalised, with 

2 

f* = f 

l|fi|| + l|f 2 || ’ 1 J 

before applying equation (1) and (2). 

The segmentation process is performed as in the Laserscanner based ap- 
proach. Two Laserscanner measurements are assigned to the same segment if 
their distance components d x and d y are below the threshold 9q. However the 
threshold is now a function of the similarity measure ip, with 

9(ip) =9 0 (aip + b), (4) 



where a and b are parameters of a linear transformation of ip. The parameters 
a and b are chosen so that 9(ip) is increased for similar optical flow vectors 
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and decreased for dissimilar vectors. If there is no optical flow assigned to the 
Laserscanner measurement a threshold of 9{ip) = 9q is chosen. 

This segmentation approach performs well if the optical flow vectors can 
be determined precisely even at the object boundaries where occlusions occur. 
As this could not be achieved with the chosen optical flow approach a second 
segmentation was developed which is more robust against inaccurate optical flow 
estimates. 

4.2 Destructive Segmentation 

The destructive approach is based on the segmentation of Laserscanner measure- 
ments described in section 3. The threshold 9q is chosen so that the object dis- 
integration is low. Therefore, measurements on objects which are close together, 
are often assigned to the same segment. In this approach the video images are 
used to perform a segmentation based on optical flow estimates. The Laserscan- 
ner and video based segmentation are performed individually. If the optical flow 
segmentation indicates the existence of several objects within the image region 




Fig. 3. Optical flow profile assigned to a Laserscanner segment, (a) shows Laserscanner 
measurements which are associated to the same Laserscanner segment, (b) the respec- 
tive image region, (c) the horizontal optical flow component f u for the four scan layers 
as a function of the viewing angle a. The dotted horizontal lines indicate the a-axis 
for the individual scan layers. 
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Fig. 4. Approximation of the optical flow profile by a set of linear functions. The 
detected object boundary is indicated with the vertical dashed lines. 



of an associated Laserscanner segment, the Laserscanner segment is separated 
according to the optical flow segments. 

Fig. 3 shows a Laserscanner segment and the associated image region of a 
parking situation. The distant car backs out of a parking space. The optical flow 
estimation is attention driven and only calculated at image regions which are 
assigned to a Laserscanner measurement. The horizontal optical flow component 
f u for the four scan layers is shown in Fig. 3 (c). This optical flow profile is used 
for the optical flow based segmentation. 

The raw optical flow profile is corrupted by outliers caused by reflections or 
other effects which violate the assumptions of the brightness change constraint 
equation [7]. Therefore a median filter is applied to the optical flow estimates in 
order to reduce the number of outliers. 

The object boundaries appear in the optical flow profile as discontinuities. In 
order to detect these discontinuities, the profile is approximated by a set of linear 
functions (Fig. 4). Initially, the optical flow profile is represented by a single line 
segment L j . Recursively, a line segment is split into two if the maximal distance 
d(a, of the optical flow profile to a line segment exceeds a threshold re, 

d(a, Li) > re(|| f ||). (5) 



The threshold re is motivated by the noise in the optical flow estimates which is 
a function of the magnitude of the optical flow vector f and the expected errors 
caused by violations of the brightness change constraint equation. The gradients 
m(Li , n) of the line segments Li of the individual scan layers n are combined to 
an averaged estimate m(L,;), after deletion of potential outliers 



m(Li) 






(6) 



where N is the number of scan layers. Object boundaries are classified based on 
the averaged gradient of the line segments m(Li), with 



m(Li) > m max , 



( 7 ) 



where m max is the maximal allowable steepness for a line segment of a single 
rigid object. 
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The destructive segmentation assumes objects to be rigid and that object 
boundaries are mainly vertical in the image domain. In the parking scenarios 
chosen for evaluation purposes these assumptions are perfectly met. 



5 Results 

The presented segmentation algorithms were evaluated on parking scenarios. In 
all scenarios a car backs out of a parking lot. The speed of the ego- vehicle varies 
across the scenarios, which introduces an additional optical flow component with 
increasing magnitude towards the image borders. Twelve scenarios were inves- 
tigated with both segmentation approaches and compared to the Laserscanner 
segmentation. The focus was on the car which backs out and its neighbouring 
car. The time stamp and the position of the moving car were recorded when 
the two objects were first continuously separated by the Laserscanner approach. 
Then, the time stamps and positions for the two other approaches were noted. 
The average of the differences between time and position of the optical flow 
based approaches with respect to the Laserscanner segmentation are concluded 
in Table 1. 



Table 1. Gained time and the respective covered distance of the car backing out of 
the parking lot. 



Constructive Destructive 

Time [sec] Distance [m] Time [sec] Distance [m] 
2/2 L5 2i3 L6 



In average the optical flow based segmentations detect the moving car as an 
individual object 2.2 sec (2.3 sec) earlier than the Laserscanner segmentation. 
This gained time corresponds to a covered distance of the car backing out of the 
parking lot of 1.5 m (1.6 m). 

The two spatio-temporal segmentations perform similar in terms of an early 
separation of the two objects of different lateral speeds. However, the more gen- 
eral constructive approach exhibits a higher degree of object disintegrations. 
The two objects are often represented by more than two segments. This is due 
to inaccuracies in the optical flow estimation especially at object borders. 

The destructive approach is less general, as is takes only the horizontal optical 
flow component into account. This is, however, the main motion component of 
cars moving lateral to the sensors viewing direction and therefore sufficient to 
consider with respect to the application. The filtering and region based linear 
approximation of optical flow estimates enables the algorithm to be more robust 
against inaccuracies in the optical flow estimation. The result is a very low degree 
of object disintegration. 




374 N. Kaempchen, M. Zocholl, and K.C.J. Dietmayer 



Further examination of the results exhibited that the performance depends on 
two main factors independently of the chosen algorithm. First, the performance 
decreases with increasing velocity of the ego-vehicle as the optical flow artefacts 
and the noise raises and therefore the SNR decreases. Second, the performance 
depends on the velocity difference tangential to the viewing direction between 
the close objects. The higher the velocity difference the better the performance. 
In the scenario of a car backing out of a parking lot, the performance depends 
directly on its velocity which varies in the experiments between 1 and 9 km/h. 

6 Conclusion 

Two spatio-temporal segmentation approaches have been presented. Based on 
Laserscanner measurements and optical flow estimates of associated image re- 
gions a robust segmentation of objects is enabled even if objects are close to- 
gether. In potentially harmful situations the correct segmentation allows a pre- 
cise tracking of moving objects. The accurate segmentation and therefore track- 
ing is an essential prerequisite for a reliable prediction of objects in dynamic 
scenarios for active safety systems in future cars such as automatic emergency 
braking. 
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Abstract. This work addresses the two major drawbacks of current 
statistical uncertain geometric reasoning approaches. In the first part a 
framework is presented, that allows to represent uncertain line segments 
in 2D- and 3D-space and perform statistical test with these practically 
very important types of entities. The second part addresses the issue of 
performance of geometric reasoning. A data structure is introduced, that 
allows the efficient processing of large amounts of statistical tests involv- 
ing geometric entities. The running times of this approach are finally 
evaluated experimentally. 



1 Introduction 

In [5] the uncertain geometric entities point, line and plane in 2D- and 3D-space, 
represented using Grassmann-Cayley algebra, were used to perform statistical 
tests such as incidence, equality, parallelism or orthogonality between a pair of 
two entities. This is a very useful tool in many computer vision and perceptual 
grouping tasks, as both often deal with measurements of geometric entities and 
rely on the relational properties of the measured entities between each other (cf. 

[11], [8], [9]). 

However, there are two major drawbacks in this approach: first the 
Grassmann-Caley algebra does not allow to represent focalized objects, such 
as line segments in 2D- and 3D-space, in a straightforward manner and second 
there are no considerations about performing a huge amount of relational tests 
in an efficient manner. 

Both of these drawbacks are addressed in this work. The first issue is ad- 
dressed by using compound entities, i.e. to construct new geometric entities 
from the existing base entity classes, on the one hand and moving from the pro- 
jective framework of [5] and [7] to an oriented projective framework (cf. [13]) on 
the other hand. The second issue is addressed by proposing a data structure for 
storing the entities and gaining efficiency in testing geometric relations over a 
large amount of data. The proposed data structure will resolve the shortcomings 
of the classical multi-dimensional data structures R-Tree, R*-Tree and Quadtree 
(cf. [6], [12], [4], [10]), that are unable to store uncertain line segments for the 
efficient use in statistical testing tasks. 

The speed gained by using the proposed index structure for geometric reason- 
ing, instead of simply computing all relational properties pairwise in a sequential 
manner, will be evaluated experimentally. 
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2 Compound Geometric Entities and Their Relations 



2.1 Base Entities in Oriented Projective Space 



The line segments will be constructed from uncertain geometric base entities in 
oriented projected space. For this base entities first consider the 2D-case: a point 
and a line may be represented by homogeneous 3-vectors x and 1. In oriented 
projective space the sign of the scalar product l T x can be used to indicate, if the 
point lies on the right hand side or on the left hand side of the line. This can be 
used to define the notion of direction for lines and orientation for points. Note, 
that points with negative orientation do not correspond to Euclidean points. If 
one represents the uncertainty of the entities with their covariance matrices E d 
and E xx and chooses a threshold T a according to the ^-distribution as proposed 
in [5], the statistical incidence test can be extended in the following way: 



if 



l T x 

F£ xx l + x r £ u x 






(1) 



holds, there is no reason to reject the hypothesis, that the point lies on the 
left hand side of the line. This will be denoted by x £~ 1. 
if 

— \/T a < i , „ (2) 



x T Enx 



holds, there is no reason to reject the hypothesis, that the point lies on right 
hand side of the line. This will be denoted by x £ + 1. 



Notice, that the two cases are not mutually exclusive, but the combination of 
both conditions yields the classical incidence relation, that is proposed in [5]. 
This will be denoted by x £ 1. 

In 3D-space the situation for points and planes is just the same, since every 
test comprising of a scalar product can be extended this way. In addition to 
the incidence relation the notation for the relations parallelism (||, || _ and || + ) 
and orthogonality (_L, _L~ and _L + ) are introduced as well in the case of scalar 
valued test statistics. If the test statistic instead is vector valued and bilinear, 
the situation is a little more involved. Let us first consider the case of a point 
X and a line L in 3D-space. According to [5], there is no reason to reject the 
hypothesis X £ L if 

d T E+ d d < T a (3) 

with 



d = r T (L)x and s dd = n r (x)£ LL n(x) + r T (L)E xx r(L) 

and T a chosen according to the x|-distribution (see [5] for the definition of 
the matrices II and T). Since d is a vector, the notion of a single sign is not 
applicable here. However a test can be formulated, whether two points X and Y 
lie on opposite sides of L, by requiring, that X and Y lie on opposite sides of each 
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— T — T — T 

of the four planes defined by the rows of T (L) , that is, if T (L)X = — r (L) Y. 
Thus one obtains the following statistical test: if for all i = 1..4 the condition 



ix j y 

< T a A > -T a ) V 

cr^x a,y u ' 



( < T a A > — T ) with 

l <7 d y “ “ 



< = U (L)x a%.=TU (x)s LL n 2 (x) + r,; (L)s xx r i (L) 



<=U (L)Y a 2 d y = n i (Y)S LL n i (Y) + Ti (L)X YY r,;(L) 



and T a chosen according to the ^-distribution holds, then there is no reason 
to reject the hypothesis, that X and Y lie on opposite sides of L. This will 
be denoted by (X, Y) e® L. Every bilinear test statistic can be used this way, 
although the interpretations of the test are not as clear as in the case of point-line 
incidence. 



2.2 Representing Line Segments and Their Tests 

First consider the 2D-case again: A line segment can be represented by its two 
end-points x and y, the line 1 connecting those two end-points and the two lines 
m and n, orthogonally intersecting 1 in x and y and directed, such that their 
normals point away from the line segment. More details about the construction 
of such line segments can be found in [2] . 

Again the construction generalizes to 3D line segments in a straightforward 
manner, by using the end-points X and Y, the connecting line L and the planes 
A and B orthogonally intersecting L in the points X and Y, directed, such that 
their normals point away from the line segment. 

It is now possible to perform a sequence of statistical tests on the base ele- 
ments to obtain a result for the compound entity. For example the incidence of a 
2D point z with the 2D line segment (x, y, 1, m, n) can be defined as either z be- 
ing incident to one of the endpoints x or y, or z being incident to the connecting 
line 1 and lying between the two directed lines m and n. In the previous notation 
with logical and denoted by A and logical or denoted by V this then looks like: 
z = xVz = yV(z€lAz m A z £“ n). Other statistical tests including inci- 
dence, equality, orthogonality and parallelity with 2D line segments are derived 
easily in a similar manner (cf. [2] for details) . In case of 3D line segments some 
useful relations are summarized in table 1. It can be seen, that a lot of useful 
statistical tests can be performed very easily with the proposed representation 
for line segments. 

3 Storing Uncertain Geometric Entities 

3.1 Necessary Conditions 

Now a data structure will be developed, that allows to efficiently find all uncer- 
tain entities, that match a given bilinear relation with a given uncertain entity, 
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Table 1 . Relations with the 3D line segment (Xi, Yi,Li, Ai,Bi) 



Entity 


Relation 


Tests 


point Z 


incident 


(z = x x ) v (z = Yi) v ((z e l x ) a (z e _ a x ) a (z e~ b^) 


line M 


intersect 

orthogonal 

parallel 

incident 


L-l e MAfXpYi) M 

L x e MAfXpYi) M A L x _LM 

Li||M 

L x = M 


plane C 


intersect 

incident 

orthogonal 

parallel 


(x lf y x ) c 

e c 

(Xp Y x ) G® CALi-LC 
Li||C 


line segment 
( x 2 > Y 2 > L 2 ’ A 2 > b 2) 


intersect 

orthogonal 

parallel 

incident 

equal 


Lj e l 2 A (Xj.Yj) e® l 2 A (x 2 , y 2 ) e® l x 

l x e l 2 a (x x , y x ) e® l 2 a (x 2 , y 2 ) e® l x a Li_ll 2 

((X X £- A 2 A Yi £- B 2 ) V (X X 6- B 2 A Y x E“ A 2 )) A Lil|L 2 

(( X 1 € — A 2 A Y 1 e~ B 2 ) v (x 1 e — b 2 a y : G~ A 2 )) a l 1 e l 2 

(X-| EX 2 AY, = y 2 ) V (X 2 = y 2 A Y t = x 2 ) 



e.g. given a line segment, one is able to find all those line segments, that orthogo- 
nally intersect the given one, out of a large set of stored line segments. Therefore 
a necessary condition for bilinear tests like eq. (3) is derived first. The generic 
bilinear test has the form 

d T S^d < T a> „ (4) 

with 

d = A(x)y and S dd = A(x)S yy A(x) T + B(y)S xx B(y) T 

With of, denoting the largest eigenvalue of E xx , a ^ denoting the largest eigen- 
value of £ yy and the rows of A and B denoted by a; and b; , a necessary condition 
for eq. (4) is given by 

(a?y ) 2 ( a ^y ) 2 

a-2 afai + CT^bfbi a^a„ + o-2b^b„ 

Since all terms are positive, this can only hold, if 



Vi 



Vi 



(afy ) 2 



< T n 




) < 

) |y| N 



where the inequality (1) holds, because the b, are projections of y onto some 
subspace for every relation considered (cf. [5]). One can also assume, that all oq 
and y are spherically normalized, because the entities in oriented projective space 
are represented by homogeneous vectors. If one substitutes S x = \jT a . n a x 

and S y = ]n (jy a necessary condition for eq. (4) (cf. [2] for a proof) is 

given by 



\ a Jy\ < 



cos (f — arccos 5 X — arccos S y ) if 5 X + S y < 1 
1 otherwise 



( 5 ) 
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This equation has a simple geometric interpretation: The hypothesis test of eq. 
(4) can only result in not rejecting the hypothesis, if there is a vector a' within 
the cone with axis a, and opening angle arccos S x and another vector y' within 
the cone with axis y and opening angle arccos S y , so that the vectors a' and y' 
are perpendicular. 

Notice, that reasoning along the same lines yields necessary conditions for 
the positive and negative orientation test (cf. eq. (2) and eq. (1)): 



± a,Jy < 



f cos ( f — arccos S x 

\ 



arccos S y ) if S x + S y < 1 
otherwise 



( 6 ) 



Thus, associating a key ( y,6 y ) with each base entity (y, E yy ), one is able to 
check only using this key together with eq. (5) or (6), if a statistical hypothesis 
test with the associated entity might result in not rejecting the hypothesis. 



3.2 Combination of Keys 

The next step is to combine two keys (y 1 , S Vl ) and ( y 2 , S y2 ) into a new superkey 
(y',S y >), such that the superkey yields a necessary condition for both of the 
keys. Since all keys represent lrypercones, one looks for the enclosing hypercone 
to calculate the superkey. Note first, that the axis of the superkey’s hypercone 
must lie in the lryperplane spanned by the two axes y 1 and y 2 , thus one can 
first calculate the intersection of the lrypercone (y 11 S Vl ) with this hyperplane. 
Because it lies in the lryperplane, it can be parametrized by yj = (1 — X)y 1 + A y 2 

and inserting into the lrypercone condition results in ^ y [ y q 1 ^ = 1 — 5 yi . Solving 

this quadratic equation for A yields two solutions and thus two vectors y^ and 
y' 12 . Doing the same for the hypercone (y 2 , S V2 ) yields two more solutions y 21 and 
y 22 . Two of those four lines must lie on the surface of the enclosing cone, namely 
those two with the greatest enclosing angle. To find those, one must first orient 
the lines to point into the same direction as the corresponding lrypercone axis. 
This can simply be achieved by checking signs of scalar products. Together with 
a spherical normalization one obtains y*j = sign (y^y,;) Since the surface 
of the enclosing lrypercone must lie on both sides of the axis y 2 in relation to 
one now determines, which lines lie on which side: 

n = f y* i if yz-i T y*i > y 3 ~i T y*2 V F = { y * 1 if y^-i T y*n < y 3 -i T y*2 

| y * 2 otherwise | y* 2 otherwise 

Finally one is able to select those two oriented lines, that include both lryper- 
cones, again by simply checking scalar products: 

m = f yf if y 2 T yf < y 2 T y^ n = f y[ if y^yl < y\ T yf 

y y 2 otherwise y^ otherwise 

Thus the superkey is now given by: 

. m + n 

y = i — ; — i 

| m + n | 
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§ y , = / \J 1 - ( m T y') 2 if y' T y\ > 0 A y' T y\ > 0 
[ 1 otherwise 

By definition it has the property, that whenever eq. (5), or (6) holds for any of 
(y 1 ,S Vl ) or (y 2 ,S V2 ), it must hold for ( y' ,8 y /). Also notice, that it can easily be 
generalized to more than two keys just by sequentially enlarging the lrypercone. 

3.3 The Data Structure 

Having defined those keys, one is now able to define an R-Tree like data structure 
(cf. [6]), that allows to store compound uncertain geometric entities of a single 
type, as follows: 

— every node of the tree contains at most 2 M and at least M elements, unless 
it is the root 

— the elements of the leaf nodes are the compound uncertain geometric entities 
((yi, s yiyi)>---:(y™,£y„yj) together with a key ((y 1 , S yi ), (y n , S y J) as 
defined in section 3.1 

— every inner node’s link is associated with a key ((y[, S y > ), ..., {y' n ,S y ' n )) con- 
structed from the subnode’s keys as described in section 3.2 

Two facts follow immediately from the definition of the tree: first its height is 
bounded by 0(log N) (cf. [1]) and second a statistical test with an entity stored 
in a leaf node can only result in not rejecting the hypothesis, if for all keys along 
the path to the root, the eq. 5 or 6 (depending on the test) holds. The second 
property is used to define the query algorithm for the data structure, by only 
descending into a subtree if the necessary condition with the key holds. Thus, 
the more complex the query, the better is the performance of the algorithm, 
since more subtrees can be truncated at an earlier point in time. 

To insert an element into the tree and maintain the first property, a strategy 
similar to the construction of an R.-Tree is used. On every level the algorithm 
computes for every subtree the enlargement of the opening angles of the keys 
lrypercones and inserts the entity into the subtree, where the enlargements are 
minimal. If a node has more than 2 M elements, the node is split into two subsets 
of size M and M + 1, such that the opening angles of the super keys lrypercones 
of the elements of each subset are minimal. To find those subsets, the quadratic 
split heuristic proposed in [6] is used. As shown in [1], the running time of this 
algorithm is bounded by 0(logN). 

A more detailed description together with some implementation details can 
be found in [2], Note also, that the data structure is not limited to line segments, 
but can store and perform any kind of statistical test on data, that is constructed 
from multiple (or single) uncertain base entities of the Grassmann-Cayley alge- 
bra. For points it is similar to the classical R.-Tree, thus a similar performance 
can be expected in this case. 
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4 Experimental Evaluation 

The running times of the 
data structure proposed 
in the previous section 
were evaluated on artifi- 
cial line segment data in 
3D-space. A set of N line 
segments inside the cube 
of volume 1 centered at 
the origin were generated 
randomly. The line seg- 
ments were of random 
length between 0.05 and 
0.1 and random orienta- 
tion with the standard 
deviation of the end- 
points being 0.001. All 
N line segments were in- 
serted into the proposed 
data structure and another random line segment was used to retrieve all line 
segments from the data structure, that intersect the given one. Intersection was 
chosen, because it has the broadest field of application, though the other rela- 
tions behave very similar. An application example for this kind of query that 
benefits from the proposed data structure can be found in [3] . The running times 
on a current standard desktop computer for different values of the nodes half 
size M are depicted at the bottom of figure 1. 

Since classical multi-dimensional data structures do not support statistical 
geometric tests as query, the running time for sequentially comparing all N line 
segments with the given one is shown in the middle of figure 1. It can be seen, 
that the improvement is up to a factor of 50, depending on the number of line 
segments stored in the data structure. 

The drawback of using an index structure is, that the construction requires 
time. The construction times for different values of M are depicted on top of 
figure 1. It can be seen, that the choice of M = 2 is best, since the construc- 
tion time heavily depends on M and the query times do not depend on M so 
strongly. Certainly a large amount of queries, for example required by a spatial 
join algorithm, to a fairly large and static set of line segments is required to 
exploit the benefits of the proposed data structure. 

5 Conclusion 

In this work a framework was presented, that allows to perform statistical tests 
on uncertain geometric entities constructed from tuples of uncertain base entities 
in oriented projective space. It was shown, that uncertain line segments in 2D- 




Fig. 1. Running times for construction and intersection 
queries of 3D line segments 
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and 3D-space are constructible in such a way, that statistical reasoning about 
this practically very important geometric entities is possible in this framework. 

The second contribution of this work is the introduction of a data structure, 
that allows to perform this kind of tests in an efficient manner. The special 
structure of statistical testing was used in the design of the data structure, such 
that it is capable of performing complex statistical reasoning tasks in an efficient 
manner. It therefore outperforms classical multi-dimensional data structures, 
since they are not able to handle this kind of queries. 

Since the amount of measured, i.e. uncertain, geometric data in many com- 
puter vision tasks is extremely high, the need for efficient geometric reasoning 
algorithms is evident. The experiments showed, that the gain in performance is 
very high, if large amounts of data are to be processed, so that the application 
of the presented framework and data structure could lead to new, more feasible 
algorithms in the analysis of large aerial images or large image sequences, where 
known statistical properties of the measured data can be used. 
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Abstract. We present a new measure for evaluation of algorithms for the detection 
of regions of interest (ROI) in, e.g., attention mechanisms. In contrast to existing 
measures, the present approach handles situations of order uncertainties, where 
the order for some ROIs is crucial, while for others it is not. We compare the results 
of several measures in some theoretical cases as well as some real applications. 
We further demonstrate how our measure can be used to evaluate algorithms for 
ROI detection, particularly the model of Itti and Koch for bottom-up data-driven 
attention. 



1 Introduction 

During the last decade, several studies were conducted concerning the mechanism of 
attention in human and animal vision. It has been observed, that biological vision is 
based on the dynamic selection of regions of interest (ROI) through the guidance of the 
gaze towards selected scenic regions. In other words, regions in the current field of view 
were selected to focus high-resolution processing-ressources at locations that are likely 
to contain important information for the task currently performed. Multiple ROIs are 
then scanned in a serial manner by fixating the high-resolution fovea of the eye on these 
suspicious locations using saccadic movements. 

Following these studies on biological vision, several computational models for adapting 
this principle of "regions of interest" have been proposed [3] [5] [9] [2]. Although their 
internal working schemes often differ significantly, they all result in a set of locations for 
different regions of interest. Key mechanisms: select image locations of high information 
content, often measured in terms of feature contrast. 

Stark and Choi [8] investigated the sequence of selected fixations. They found that 
the path of scanned locations repeats itself after several fixations. The authors coined 
the underlying mechanism of the mind’s eye, the "scanpath theory" of attentive vision. 
However, the authors also showed, that there isn’t such a thing as a global scanpath, 
that is, that everybody looked at the same spots in the same order. Even the scanpath of 
the same observer for the same stimulus isn’t unique. Thus it is not possible to directly 
compare the results of different models, because even if both models predict the same 
ROIs, their order is very likely to differ. Similar problems arise when trying to evaluate 
a single model under various conditions, like different noise or illumination. 

Existing measures for the evaluation of ROI-based attention algorithms have diffi- 
culties representing these variations in the order of regions of interest. We present a new 
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measure which is capable of handling such uncertainties. This enables us to evaluate 
these algorithms and to compare them against human or animal observer. Using this 
measure we will evaluate the attention model of Itti and Koch [3]. For that reason we 
give a short review of the model in section 2. The measure itself will be developed in 
section 3 and its results are then shown in section 4. 

2 Review of Itti and Koch’s Approach for Attention 

Itti, Koch and Niebur [3] presented a popular 
model for computing regions of interest us- 
ing a saliency map which is obtained from a 
pyramidal multi-resolution representation of 
the input data. The computation is purely data 
driven, as it does not incorporate any feedback 
or knowledge-based mechanisms. 

First a gaussian pyramid is built from 
the input stimulus. This pyramid of different 
scales of the image is then split into several 
channels selective to different features like 
color, intensity or orientation. Following this 
step, a center-surround operation is performed 
on each of these multi-scale representations. 

All these maps are then combined into a single 
saliency map. Different strategies of combin- 
ing these maps have been discussed and analysed in [4], The simplest method is just 
summing up all the maps. In order to achieve this, the maps have to be scaled to the same 
spatial size prior to summation. More advanced methods we applied perform iterative 
center-surround inhibition to sharpen the data and extract local maxima [4]. 

On the resulting saliency map, a winner-take-all algorithm is applied which deter- 
mines the most salient location which has to be attended next. In order to avoid, that this 
location is also attended in the next steps, the currently attended location is inhibited 
within the saliency map, so the selected locations are inhibited for a certain number of 
iterations (inhibition of return). From a biological view, it is not yet entirely clear if such 
a global saliency map really exists in the human brain. Recent experimental findings 
suggest that cortical area V4 might be a candidate [6], 

3 Methods of Evaluation 

In this section we will present two simple measures for evaluating models that select 
regions of interest. We will further explain how our new measure is calculated. 

3.1 Order-Independent Measure 

The measure presented in this section is the simplest method to evaluate a ROl-model. 
Assume that we have a ground truth consisting of n ROIs. It is then checked how many 
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Fig. 1 . Attention model of Itti and Koch [3] 
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of these ROIs of the ground truth are found again within the first n results during the 
test-run of the model. Example: If the ground truth has 5 ROIs and 3 of them are also 
found during the test run, the result is simply 0.6 or 60%. This measure can be easily 
implemented and also be calculated very quickly. Its major limitation is that it does not 
at all take care of the order in which the ROIs were chosen. 

3.2 Order-Dependent Measure (String Edit Distance) 

Another measure to evaluate models is the "string edit distance" [1]. Each ROI from 
ground truth is labelled with a separate letter and these letters are concatenated in the 
order of appearance of the corresponding ROI to form a "ground-truth-string". After 
running the algorithm that is evaluated a second string is built from the ROIs selected 
by the algorithm. 

The two strings obtained so far are furthermore compared in a way that it is calculated 
how "costly" one string can be transformed to the other. Costs are defined for insertion, 
deletion and substitution of letters. The minimum costs of this transformation are usually 
computed using dynamic programming. 

One limitation arising with this measure is, that it is not possible to define two regions 
of interest of equal importance, instead one ROI has always to be preferred over another 
when setting up the ground truth and its labelling order. 

3.3 Hybrid Measure 

In order to circumvent the limitations of the two measures presented above, we develop 
a new measure which considers the order of the regions of interest and also accounts for 
systematic variations. 

One calculation run of our proposed measure consists of several steps. 

1. First the ROIs need to be determined. This may be done by any kind of source: 
Human/animal observers, by a person, or by a computational model. 

2. Just like for the calculation of the minimum string edit distance, the ROIs need to 
be assigned numbers. 

3. The relative order of the ROIs is then stored in a matrix for further processing as 
shown in Fig. 2. 

4. All the previous steps are performed multiple times, and the resulting matrices are 
summed up. 

5. Finally the resulting matrix is normalized by dividing it through the number of 
iterations the previous steps were done. 

6. The obtained matrix encodes the probabilities for all (a,b), i.e. denoting that ROI a 
preceeds ROI b. 

First, the calculation presented above has to be done once for a ground-truth-run, 
resulting in a matrix A. This matrix describes the relative order of ROIs for the ground 
truth. After that, the calculation is performed for one or more test-runs and a corre- 
sponding matrix B is returned. The two matrices are then compared by calculating the 
normalized crosscorrelation of the two matrices: c = |A||iJ|/y / |A||B| 

The measure presented in this section is capable of handling both strict as well as 
loose order of the ROIs. 
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Fig. 2. The matrix created from the relative order of ROIs. The ordinate of the matrix denotes the 
label of the current ROI. the abscissa denotes the label of the ROI that preceeded the current one. 
As the very first ROI detected doesn't have any predecessor, an additional row on the abscissa is 
needed. Each value in this "pre-occurence matrix" represents the probability that ROI a preceeds 
ROI b. It can be observed that the sum of each column and row respectively has to be one. When 
a strict order of ROIs can be defined, the matrix obtained simply consists of all ones on the 
second diagonal. If there are several ROIs that should have the same importance probabilities 
are simply spread over several possible predecessors. In the example, the order of the ROIs is 
(0 — »)1 — » 3 — > 2. In order to account for regions of interest that were not present in the ground 
truth, an extra row and column exist, which sum up those additional regions of interest. For the 
ground truth those cells are simply all zero. 



4 Results 

In this section we evaluate our measure by analysing its behaviour in some theoretical 
scenarios. Afterwards we will compare our measure against string edit distance regard- 
ing the impact of noise on the model of Itti and Koch [3] . In order to be able to do this we 
need to normalize all three measures to the range [0,1]. This is done in the following way: 



• order-independent measure : The values returned by this measure are within the range 

[ 0 , 1 ] 

• string edit distance: m = 1 — dlst ^ nce ; with n denoting the number of ROIs in the 
ground truth. 

• hybrid measure: as all elements of the histogram matrices are positive, the results 
already lie within the range [0,1]. 

Furthermore, we will also evaluate two of the feature combination strategies pre- 
sented by Itti and Koch [4] using our newly presented measure. 

4.1 Theoretical Cases 

First, we compare the three measures presented so far in two theoretical scenarios. 

Scenario A: Suppose we have three regions of interest, labeled 1 to 3 and their 
correct order be 1 — >• 2 — > 3. We now assume the ROI-algorithm detects them in the 
order 1 — > 3 — > 2. 
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For a suitable measure it is expected that the value falls below 1 , as the order of detection 
is incorrect. The results of the three measures presented in section 3 are shown in table 
1. As we can see, the order-independent measure does not account for the wrong order 
of detection. Both, string editing distance as well as our statistical measure correctly 
decrease to a lower value. 

Scenario B: We assume three ROIs as before, but the ones, labeled 2 and 3 are 
equally important. This means, in 50% of the cases the detected order is 1 — > 2 — > 3 
and 1 — > 3 — »• 2 otherwise. As the order of ROI 2 and 3 should not matter, the measure 
should not decrease but stay on the maximum value of 1 . Again, the results of the three 
measures are shown in table 1 . Since the order-independent measure does not care about 
ROI sequence, its result still is 1 , which is correct in this case. Also, our hybrid measure 
correctly returns 1, as the order of ROI 2 and 3 should not matter. String edit distance 
however cannot cope with such a situation and falls to | . 

Table 1. Results obtained from the three measures for our theoretical scenarios. 



scenario 


expected 


order-independent 


order-dependent 


hybrid 


A 


< 1 


1 


0.33 


0.33 


B 


1 


1 


0.67 


1 



This demonstrates that our new hybrid measure can deal with both theoretical situ- 
ations, strict order as well as partially ambiguous order. 

4.2 Evaluation of the Model of Itti and Koch 

We now take a look at how the new proposed measure performs on real applications. 
We will therefore evaluate the impact of noise on the model of Itti and Koch using 
two different feature combination strategies: Simple summation versus iterative local 
inhibition [4] . The input stimulus shown in Figure 3 has been motivated by recent findings 
about saliency and pop-out effects [7]. The evidence suggests that feature contrasts, rather 
than absolute values, were relevant measures that lead to target selection and boundary 
detection. Here the stimulus is composed of equal size disks and a smaller one thus 
leading to a contrast in circle diameter. As the image is symmetric, all five big circles 
should result in the same conspicuity. That is, their order of detection should not matter. 
In contrast, the small sixth circle should results in a higher saliency as it generates a 
strong contrast against the other disks. It is the only peak present in the corresponding 
spatial scale, so it ought to be the first ROI detected by the algorithm. In our test scenario 
we have applied white gaussian noise to the input stimulus ranging from sigma = 0 
up to sigma = 2 in steps of S = 0.2. We then applied our implementation of the 
model of Itti and Koch on these input stimuli. We did so using two feature combination 
strategies, simple summation as well as iterative local inhibition. In [4] it was shown 
that iterative local inhibition performs better than simple summation, it is therefore 
expected that the proposed statistical measure quantifies this result. Figure 3 shows 
the corresponding results. We obtained the expected result, showing that local iterative 
inhibition outperforms simple summation. 
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Fig. 3. Left: Input stimulus. Six radially aligned discs with equal intensity. Five of them have the 
same diameter, the sixth one is significantly smaller. Right: Simple summation (squares) versus 
local iterative inhibition (stars). It can be seen that the latter performs better than simple summation 
under the influence of noise. On the abscissa, sigma denotes the amount of noise applied to the input 
stimulus. "Measure value" on the ordinate denotes the value returned by our proposed measure. 



We will now compare our statistical measure against string edit distance and the 
simple measure.The feature combination strategy used for the model of Itti and Koch 
is the local iterative inhibition. First, we select a scenario which all of the measures 
should be able to handle. The corresponding input stimulus is shown in Figure 4. It 
consists of six radially aligned circles with decreasing intensity and equal diameter. As 
the brightness of the circles has an order, their respective saliency is ordered as well. 
Therefore, there is a strict order of preference of the regions of interest. In addition, 
when the noise level is increased, the number of disks that can be detected decreases 
monotonically. Accordingly, the simple measure should also decrease. 




Fig. 4. Left: Input stimulus. Right: Performance of the three measures presented in section 3, string 
edit distance (squares), simple measure (triangles) and our proposed measure (stars). All measures 
remain almost constant up to around sigma = 0.8, from where on they fall linearly to a value of 
about 0.5 for sigma = 2. The simple measure rates the algorithm slightly better, as it only reacts 
on lost ROIs in contrast to the other two measures. Again, sigma denotes the amount of noise 
applied to the input stimulus. The "measure value" on the ordinate denotes the values returned by 
the three different measures. 
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Fig. 5. Left: Input stimulus. Right: As string edit distance (squares) cannot cope with equiimportant 
ROIs its value drops too fast already for small values of sigma. In contrast, the simple measure 
(triangles) overestimates performance of the algorithm, as most of the ROIs can be detected even 
under strong noise, but in the wrong order. Like for figure 4 "sigma" denotes the amount of noise 
applied, "measure value" denotes the values returned by the three different measures. 




Fig. 6. Left: Input stimulus, a real scene. Right: Sigma on the abscissa denotes the amount of 
noise added to the image. String edit distance (squares) decreases very fast, the simple measure 
(triangles) falls very slowly. Our proposed measure lies between them. 

We can see in Figure 4 that the expected results are returned. All measures behave 
similar as there is an unambigous order of ROIs and weak stimuli get lost with an 
increasing amount of noise. 

In the next step we calculate all three measures for our first input stimulus shown in 
Figure 3. The results are shown in Figure 5. Again, we perform local iterative inhibition 
prior to feature map combination. We expect, that the string edit distance measure is not 
able to interpret the output of the algorithm correctly and therefore degrades too fast. In 
contrast, the simple measure is supposed to degrade too slow, as all features have a high 
intensity contrast and therefore can be detected even in noisy situations. 

As expected string edit distance rates the performance of the algorithm worse and 
drops quickly. In contrast, the simple measure which does not handle variations in the 
order of ROIs decreases only very slowly. The result of our proposed measure lies 
between these two other measures. 

Finally we evaluate the model of Itti and Koch on a real life image shown in Figure 
6. As most of the windows on the fort are simlar, we expect that there are several ROIs 
of ambiguous order. For this reason string edit distance decays too fast. 
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5 Discussion 

We have proposed a new measure based on relative order for evaluating algorithms that 
detect ROIs such as for selective attention. This new measure has shown to be able to 
evaluate one single ROl-detection algorithm under alternating conditions. In addition, 
it is also able to compare different algorithms against each other. In contrast to string 
edit distance it can handle any arbitrary order of ROIs. In the current proposal, all ROIs 
have the same impact on the measure, which might not be sufficient for some situations. 
One possible extension of the proposed scheme, therefore, could be the possibility to put 
more weight on some ROIs than on others. In summary, the proposed measure is able 
to rate performance of algorithms where the order of the ROIs may be ambiguous. This 
is necessary when trying to compare a computational model with biological studies on 
human or animal observers. 
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Abstract. Identifying the action potentials of individual neurons from 
extracellular recordings, known as spike sorting, is a challenging problem. 
We consider the spike sorting problem using a generative model, mixtures 
of factor analysers, which concurrently performs clustering and feature 
extraction. The most important advantage of this method is that it quan- 
tifies the certainty with which the spikes are classified. This can be used 
as a means for evaluating the quality of clustering and therefore spike 
isolation. Using this method, nearly simultaneously occurring spikes can 
also be modelled which is a hard task for many of the spike sorting meth- 
ods. Furthermore, modelling the data with a generative model allows us 
to generate simulated data. 



1 Introduction 

Recording the spiking activity from well isolated single neurons is important for 
studying the physiological functions of the brain. Although intracellular elec- 
trodes provide good quality signals from a single neuron, recording with an 
intracellular electrode in awake behaving animals is extremely difficult. Extra- 
cellular electrodes introduced into the brain isolating a single neuron have been 
successfully used for years. Recently, there has been excitement about recording 
simultaneously from multiple neurons in order to study their interactions. Elec- 
trodes placed in the extracellular medium can record the activity of multiple 
nearby neurons but this leads us to the question of distinguishing between the 
activity of individual neurons known as spike sorting. Under the assumption that 
the extracellular space is electrically homogeneous, four-tip electrodes (tetrodes) 
provide the minimal number necessary to identify the spatial position of a source 
based on the relative spike amplitudes on different electrodes. Recording with 
multi-tip electrodes improves the identification of individual neurons compared 
to standard single-tip electrodes ([3]). 

Spike sorting is usually done in three steps, namely spike detection, feature 
extraction and clustering. Determination of the occurrence of spikes, which is 
usually achieved by high-pass filtering followed by thresholding is known as the 
spike detection step. In the feature extraction stage, a feature vector for each 
spike is calculated and clustering is done on this low dimensional feature space. 
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Fig. 1 . Data recording and representation: Extracellular waveforms are recorded with 
a tetrode. Every time the signal exceeded the threshold in one of the channels, a window 
of 1ms around this event was extracted from the recordings of each channel and joined 
to form the data vectors 

In most laboratories the clustering is done manually, usually using only the peak- 
to-peak amplitude data as features in order to make visualisation possible. There 
are also automatic spike sorting techniques that have been proposed which use 
different methods for feature extraction and clustering (see [5] for a review). A 
common problem of the spike sorting techniques is that the true labels for the 
recorded data cannot be known without the verification of intracellular record- 
ings, which makes it very hard to evaluate the results obtained by clustering 
techniques. 

Here, we present an automated way of spike sorting based on mixtures of 
factor analysers (MFA) using data collected with tetrodes. MFA is a statistical 
method that concurrently performs clustering and feature extraction. It models 
the spike waveforms and therefore overcomes the major cause of separation error, 
overlapping clusters in the low dimensional feature space. MFA can also model 
the nearly simultaneously occurring spikes which is a hard task for many spike 
sorting methods. It assigns responsibility degrees to each cluster for each spike 
and the entropy of these responsibilities can be used as a means for clustering 
evaluation. In addition, modelling the data with a generative model allows us to 
generate simulated data. The next section describes the data used in this study. 
In section 3, the clustering method is explained and in section 4 we demonstrate 
the method with real data collected with tetrodes. 



2 Data Collection 

We used data recorded with tetrodes from awake behaving macaque monkeys. 
The data were collected using a multi-channel data acquisition system (Cheetah 
Inc.). The signal was band-pass filtered between 600-6000t£z and digitised at 
32 kHz. The neuronal spikes were acquired via an interrupt-driven, spike volt- 
age threshold-triggered data acquisition system. That is, when a signal above 
threshold was detected in one of the channels, the occurrence of a spike was as- 
sumed and therefore the signal in all channels was stored within a window length 
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of 1 ms around the triggering event with a corresponding time stamp. The 32- 
dimensional signals of all four channels were joined to form the 128-dimensional 
data vectors for the MFA model (see figure 1). 

3 Methods 

In unsupervised learning, dimensionality reduction and clustering are usually two 
steps that come sequentially. In dimensionality reduction, the features that are 
highly correlated are compressed, whereas in clustering, the data with similar 
features are grouped. MFA combines a well known dimensionality reduction 
technique, factor analysis, with a widely used clustering technique, mixture of 
Gaussians. 

3.1 Factor Analysis 

Factor analysis (FA) is a latent variable model in which a p-dimensional real- 
valued data vector y is modelled using a fc-dimensional vector of factors where 
k « p. Dimensionality reduction is achieved by finding a low dimensional 
projection of the high dimensional data that captures most of the correlation 
structure of the data. The generative model is given by: 

y = Ax + p + e (1) 

where A is called the factor loading matrix. The factors x and the noise are 
assumed to be normally distributed, x ~ A/"(0, 1) and e ~ Af(0,)P), where d 1 is 
a diagonal matrix. Therefore, y is also normally distributed with mean p and 
covariance AA T + <F. If P is constrained to be <51, then in the limit <5 — >• 0, the 
FA becomes a PCA. In FA, the scaling of the coordinates is not important, but 
the axis rotation in which the original data arrived is important since noise is 
independent along the axes the input data are represented [7]. 

3.2 Mixtures of Factor Analysers 

By using a mixture of factor analysers, dimensionality reduction and clustering 
can be achieved simultaneously. If we consider a mixture of M factor analysers, 
the distribution of the data becomes: 

M 

P( y) = ^2 niAfipi, AiAi T + P) (2) 

»= i 

where tt,; denote the mixing proportions. 

Given a set of observations Y = {yi, . . . , yjv}, we can describe the joint 
distribution of the data and the hidden factors using binary indicator variables 
Zi, i = 1, . . . , M as 



N M 

P(Y , x, z) = J|^P( 2 : i )P(x|z i )P(y j |x,^) 

j = 1 i= 1 



(3) 
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where Zi = 1 indicates that the example was generated by the i th mixture. The 
unknown parameters Aj. /ii, tt. ( and that best model the covariance structure 
of the data can be found by maximum likelihood estimation using the EM algo- 
rithm, described below. 

In the following, Y is the observation matrix and H are the unobserved 
(missing) quantities, in our case the latent variables of the factor analysis model 
(x) and the identity of the mixture component which generated the observation 
(zj). For any distribution Q over the latent variables, the log likelihood ( £ = 
ln.P(Y|0)) can be lower bounded using Jensen’s inequality: 

C = h ij dHP(H,Y|0)> J dHQ(H)ln^^^=P(Q,0), (4) 

defining the T(Q,ff) functional. Alternately optimising T with respect to the 
distribution of the hidden variables Q(H), and the parameters 9 is guaranteed 
not to decrease C. 

In the E step, we specify the distribution of the hidden variables that max- 
imises T 1 and calculate the expectation of the log likelihood with respect to 
this distribution. In the M step, we maximise the expected log likelihood, 
logP(H, Y|0), with respect to the parameters, keeping the Q distribution con- 
stant. 

There are two sets of conditionally independent hidden variables, thus the 
Q distribution is of the form Q(x,Zi) = Q(x\zi)Q(zi). Therefore, in the E step 
we compute the distributions of the hidden factors given the indicator variables 
Q(x\zi), the distribution of the indicator variables Q(zi) and the expected log 
likelihood with respect to the Q distribution (for details see [2]). 

In the M step, the update rule for the parameters is obtained simply by set- 
ting the derivative of the expected log likelihood with respect to the parameters 
to zero and solving for the parameters 2 : 

A = [ A i m\ = (^y i (z J x|y J ) T )(^]yi i (^xx T |y i ))" 1 (5) 

3 3 

& = ^ diag(^(z i |y J -)y i yJ - ^ A i (z i x\y j )y j T ) (6) 

ij i 3 

ni = ~]srYl( Zi l y J> 

3 



3.3 Split and Merge EM 

The EM algorithm is a hill climbing approach, thus local maxima is a serious 
problem of EM. When there are many components in one part of the space, 
and too few in another, it might not be possible to move a component from 
the overpopulated region to the underpopulated region without passing through 

1 This is equivalent to minimising the KL-divergence between Q(H) and P(H, Y|0). 

2 (.) denotes expectation wrt. Q 
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positions that has lower likelihood. To overcome this problem we used split and 
merge EM (SMEM) algorithm of Ueda et al. [8]. 

In the SMEM algorithm, when the model converges using the EM steps 
described above, two components are merged into one, and one component is split 
into two. Then, only the parameters of these three components are updated until 
convergence (referred to as the partial EM procedure). If this helps to increase the 
likelihood, the new parameters are accepted, otherwise another set of candidates 
are selected for splitting and merging and the partial EM procedure is repeated. 
The ordering of the split and merge candidates is important for the speed of the 
SMEM algorithm. We have used the criteria given in [8] to sort the candidates 
which suggests that the components that share the responsibilities 3 of many 
examples are good candidates for merging: 



T7merge(b j ' d) 



R i (6»)' r R J (6») 

MM 



(8) 



where R;(0) is the TV-dimensional vector consisting of the responsibilities of 
the i th model for each of the examples. The split criterion uses the local KL 
divergence between the local density /&( y) around the k th model and the density 
of the k th model specified by the current parameters 9: 

i7 sp iit(fc; 9) = j f k ( y; 9) log ^ dy (9) 



where z k denotes the indicator variables of the k th model. The local density is 
defined as: 



fk(y;9) 



J2n=l S (y -yn)P{zk\y n -,9) 

Y)n=l P ( Z k\yn,0) 



(10) 



3.4 Model Selection 

An important issue in using MFA is choosing the number of factors (fc) and the 
number of mixtures ( M ) to be used. The likelihood of the fit of the model to 
data is used to assess the goodness of fit in the maximum likelihood models. 
Factor analysis is a constrained Gaussian model, therefore, the best likelihood 
one can achieve using a FA is that of the full Gaussian model and the likelihood 
gets closer to this limit as the number of factors is increased. Thus, a k value can 
be chosen depending on the closeness of the likelihood of different models to the 
full Gaussian model. Cross-validation can be used for determining the number of 
mixtures, in which several values of M are fit to the data and the log likelihood 
on a validation set is used to select the the final value. Alternatively, a Bayesian 
analysis in which these parameters are determined automatically may also be 
used [1,6]. 



3 



{zi\yj) 
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4 Results 

We first trained FA models with varying number of factors to decide how much 
dimensionality reduction we could have while keeping a good representation of 
the data. The results obtained by modelling the raw data was not very promising 
since only the likelihood of the models with large number of factors were close to 
the limiting likelihood value mentioned in section 3.4. As mentioned in section 
3.1, the FA is sensitive to the orientation of the input axis. Therefore we tried to 
find a representation of the data that would allow good dimensionality reduction. 
Using the Fourier transform coefficients of the data helped to have high likelihood 
with lower number of factors (see figure 2 for a comparison). As seen in the figure, 
both raw data and Fourier transformed data likelihood converges to the full 
Gaussian model likelihood. The Fourier transformed data gets closer to this value 
with fewer factors (32 in this case). Therefore, we used the Fourier transformed 
data to train the MFA models that had 32 factors per mixture. It should be 
emphasised that we have used all coefficients of the Fourier transform obtained 
by a linear transform of the data, thus kept all information about the waveforms. 
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Fig. 2. Log likelihood for the FA models with only single mixture components trained 
with raw data (o) and with Fourier transformed data (*). The solid line shows the full 
Gaussian model likelihood, which is the limiting value. Note that using MFA, this limit 
can be exceeded, therefore this approach gives only an approximation of the necessary 
number of factors 

After determining the number of components used in each mixture, the next 
step was to find an optimal number of mixtures M in terms of the data likelihood 
and neuron isolation. We used a cross-validation scheme of MFA models trained 
on the whole waveform to determine M. We evaluated the resulting model by 
looking at the value of the entropies, observing the waveforms assigned to dif- 
ferent clusters and comparing with manual clustering. In manual clustering the 
amplitude of one channel versus another is plotted in 2-D graphs in every possible 
combination of the channels, and the human operator manually draws bound- 
aries around regions of high spike density. Therefore, we plotted the same kind 
of amplitude plots for comparison. 

We trained the MFA models using SMEM algorithm to avoid local maxima. 
The cross-validation scheme determined M = 9 to be the optimal number of 
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Fig. 3. Recorded samples assigned to different clusters by the model (left) and samples 
generated by the model (right). Waveforms produced by (a-f) different neurons, (g-j) 
nearly simultaneously firing neurons, (k) noise 

clusters. Unfortunately the average entropy of the responsibilities over all test 
samples was not as small as expected. In addition, we observed that the spikes 
assigned to some of the clusters seemed to be generated from different neurons, 
because their waveforms showed large deviations. As a consequence of this, those 
clusters were composed of multiple disjoint regions in the the amplitude plots 
and therefore did not match with the manual clustering. Thus we concluded that 
the clustering was not successful in terms of finding meaningful clusters. This 
inability to isolate the spikes with different characteristics is probably due to the 
fact that the model gets caught in local maxima where it cannot escape even 
with the help of the SMEM. To overcome this problem, we trained a model that 
had many more mixtures than the assumed number of neurons and merged some 
of the clusters after training, using the merge criteria of the SMEM algorithm. 

Training a model with 30 mixtures and merging these clusters after training 
resulted in clusters that were similar to those of manual clustering. Specifically 
for the data set used in this study, the MFA model found all clusters that were 
found in manual clustering with similar boundaries, except for one cluster which 
it clustered as a part of noise. On the other hand, the model found some tiny clus- 
ters that had double spike waveforms, which correspond to nearly synchronously 
firing neurons. Also, the model identified another big cluster as a neuron, which 
was assigned to be noise in manual clustering due to its low amplitude. Figure 
3 shows the examples that are assigned to different clusters. The trained model 
was also used to generate simulated data. As can be seen in figure 3, the sim- 
ulated signals are very realistic in the sense that they resemble the recorded 
waveforms while showing some deviations. 
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Furthermore, the entropy of the responsibilities of both the training and test 
examples is close to zero, meaning that the model is almost always sure about 
to which cluster it should assign an example. 

We have also trained a model on the 4 dimensional feature space used in 
manual clustering. SMEM algorithm helped to escape from the local maxima 
in this lower dimensional space, therefore training with more clusters was not 
necessary in this case. The clusters found by this low dimensional model were 
similar to the manually found clusters, but the double spike waveforms were not 
detected and the low amplitude cluster found by the higher dimensional model 
was assigned to the noise cluster, as expected. 

5 Conclusion 

We have demonstrated a successful approach to spike sorting of tetrode record- 
ings using MFA. This method allows to model the whole spike waveforms and 
therefore can discriminate between neurons with similar amplitude characteris- 
tics across channels and also detect nearly simultaneously occurring spikes. The 
entropies of the responsibilities gives a measure of quality of clustering. The 
trained model can also be used to synthesise realistic data sets with labels that 
can be used to compare different spike sorting methods. A drawback of this 
method is that it is not fully unsupervised. The spike waveforms assigned to the 
resulting clusters should be assessed by an expert. 
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Abstract. The human brain classifies natural scenes and recognizes ob- 
jects in complex visual patterns with a high precision in a minimum 
amount of processing time. Only few action potentials (spikes) per neuron 
and per processing stage are sufficient to achieve this astonishingly high 
performance, despite the random nature of the incoming spike trains. In 
this contribution, we present a novel algorithm which updates the inter- 
nal representation of patterns in a generative model with each incoming 
spike. We first demonstrate that our algorithm is capable of learning a 
suitable representation of pattern ensembles from stochastically gener- 
ated spike trains. This representation is then used for classifying test 
patterns, requiring less than one spike per input node to achieve a per- 
formance comparable to standard algorithms in pattern recognition. 



1 Introduction 

Recently, experimental work has shown that humans can categorize natural 
scenes within 150 ms after onset of the presentation [8]. At least ten different, 
hierarchically ordered processing stages (brain areas) are involved in this task. 
With typical firing frequencies of about 50 Hz, this leaves only time for less than 
one spike per neuron for a successful processing of the stimuli. Making things 
even worse, spikes in the cortex are elicited randomly, their statistics resembling 
a Poissonian process. 

These observations pose a challenge for pattern recognition algorithms which 
are required to achieve a high performance under restrictive boundary condi- 
tions. In our case we have three main constraints: the recognition process should 
rely on single spikes, it should be robust against a high degree of noise, and it 
should require only about one spike per input node until the scene or pattern 
is recognized. To explain the brain’s performance, one has to propose a suitable 
neuronal algorithm which fulfills all three of these requirements, while taking 
advantage of the parallel processing properties of neuronal populations. 

Previous work has shown that analog values can be transmitted faithfully 
with single spikes in a population code [2,1]. However, this approach requires the 
allocation of several channels for only one analog value. In a different paradigm 
devised by Thorpe et al. [9], this excessive usage of resources is overcome by 
employing a rank order code for spike emission. Images to be represented are 
decomposed into their principal components, and spikes are transmitted in an 
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order determined by the strength of the component’s coefficients. However, this 
rank order code can be sensitive to noise, and the decomposition process requires 
extensive pre-processing of the images which can take a lot of extra time in the 
brain. 

In this article, we present a different approach by modifying a generative 
model [4] to work with single, random spikes. The model has originally been 
used in the context of non-negative matrix factorization. A generative model 
can be described as a network consisting of input nodes connected to hidden 
nodes (Fig.l, left). The connections are interpreted as conditional probabilities 
to observe an activation in one of the input nodes, given an activation in one 
of the hidden nodes. The dynamics in a generative model updates the inter- 
nal representation in the hidden nodes (reconstruction/estimation) and/or the 
conditional probabilities (learning) . The goal of this update is to accurately pre- 
dict the input from the hidden representation. In terms of neuronal information 
processing, a successful prediction can be interpreted as a correctly perceived 
stimulus. While in a probabilistic framework, connections are interpreted in a 
top-down manner, the flow of information in a generative model (i.e., the spikes) 
proceeds bottom-up like in a feed-forward neural network. 

In the next sections, we will first present algorithms for batch learning, on-line 
learning and reconstruction. Then the algorithms will be applied to a standard 
benchmark of handwritten digit recognition (USPS). Finally, the results will be 
shown and discussed in the contexts of machine learning and neuronal networks. 



2 Generative Model 

The generative model (Fig.l, left) consists of s = 1 input nodes, 

i = 1,. , H hidden nodes, and conditional probabilities p(s|i). The K input 
patterns Vk(u) G [— oo, + 00 ], k = 1, . . . , K are converted to firing rates rfc(s) > 0, 
from which spike trains are drawn for all input nodes s. Accordingly, the proba- 
bilities for observing a spike in node s are given by pk(s) = rk(s)/ X^m=i r fc( TO )- 
In this way the sequence of active input nodes is generated from a Bernoulli pro- 
cess given by the probabilities Pk(s). Real time is then re-parameterized by the 
number of spike events which we here denote by t for simplicity. Note, however, 
that this spike-by-spike clocking implies that real time is proportional to the 
average of t/ ri, i.e. it is scaled by the total rate of the population. Each time 
t a spike is observed at node s 4 , the hidden representation h{i) and/or the p(s|i) 
are updated. One presentation of a pattern k extends over T input spikes. The 
input is then fully specified by the sequence vector of temporally ordered indices 
s T = {s 1 , . . . , s 4 , . . . , s T } of the nodes at which those spikes were observed. Dur- 
ing update, the goal is to maximize the likelihood P (s T \h(i) , p(s\i)) of observing 
s T over the model parameter space h(i) and p(s\i). With p(s) = 1 5 S}S t 

counting the relative number of spikes at node s in the observation sequence 
over the time window T, the likelihood is given by 

P (s T | {h(i) , p(s|i) } i =(lj ... > ij),s= ( i,...,M ) ) = Dn^L lP (s) Tpis ' 1 (1) 
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Fig. 1 . Spike-by-Spike network comprised of M input and N hidden nodes connected 
by the conditional probabilities or weights p(s|i) (left). During training of the network, 
the pattern and its correct classification are presented together to the first M p , and 
the remaining M c input nodes, respectively. Modified Spike-by-Spike network for re- 
construction and classification (right). The M c weight vectors used for training have 
been transposed and normalized to form the new weight vectors p(i\c) which are used 
to classify the test patterns. 



with p(s) = ^2f = iP{s\j)h(j) and the combinatorial factor D. For practical rea- 
sons, we minimize the negative logarithm of the likelihood, 

M 

- log (P/T) = - log(D/T) - ^p(s) logp(s) . (2) 

s=l 

This minimization problem is penalized by the non-negativity constraints 
p(s\i) > 0 and h(i) > 0 and the normalization constraints X^iP( s K) = 1 
and i M®) = 1> respectively. 

To avoid situations in which an input pattern of negative values leads to 
problems in the spike generation process in the input nodes, the original patterns 
v k (u ) are pre-processed yielding the input rates r k (s): in a first step, patterns 
v k (u) (with L = M/2 components) are corrected by subtracting the individual 
mean for each pattern < Vk(u) > L = 2/M X} u =i u fc( u )j 

v k (u) := v k (u)~ < v k (u) > L . (3) 



Each of the L components is duplicated and distributed over the even and uneven 
input node pairs, yielding M non-negative rate components r k (s ) according to 
the expressions 



r k (2u - 




r k { 2u) = 



+V k {u) 


for v k (u) > 0 


(4) 


0 


otherwise 


0 


for v k (u) > 0 


(5) 


~v k (u) 


otherwise 



This pre-processing is motivated by the nature of our brain: the splitting into 
negative and positive values closely resembles the analysis of visual stimuli by 
on- and off-cells in the lateral geniculate nucleus (LGN). 
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3 Algorithms 

The likelihood P can be minimized by updating h(i) (reconstruction), or by 
updating p(s\i) and h(i) together (learning). First, we will consider the p(s|i) as 
fixed. Within this section, superscripts denote time indices. 

Reconstruction. For the reconstruction, we adopt an existing algorithm 
from Lee and Seung [4] starting from the update equation 

m , , 

h t+1 (i) = /i*(«)X]p( s K)^W • ( 6 ) 

s=l P ^ 

In our case, we observe only one spike per time step in input node s 4 , and not the 
whole pattern p(s). Thus, we require the algorithm to predict the next spike by 
first substituting p(s) by 6 StS t in Eq.(6). Because the next spike is almost always 
not representative for the whole input pattern p(s), we apply an additional low- 
pass filter with time constant e leading to the reconstruction algorithm 

h t +\^ = h\i)\(l-e) + e Pi 0] . (7) 

p t [s t ) 

Batch Learning. In general, our brain acquires its knowledge and experi- 
ence over long time scales ranging from hours to years, while the fast spiking 
dynamics takes place on a time scale of milliseconds. Therefore, it is reasonable 
to separate the update time scales of h(i) and p(s\i). While h(i ) is changed ev- 
ery time a spike occurs, p(s\i) will be changed only after the presentation of K 
patterns with T spikes each. For such a batch learning rule, we can apply the 
corresponding formula of Lee and Seung [4] 

k / H 

p z {s\i) = p z (s\i ) < hk(i) > A Pk(s ) / < Mi) (8) 

k=l / j= 1 

/ M 

p z+1 (s\i)=p z (s\i) , (9) 

/ U= 1 

with < hk(i) > A = 1/A J H( 4 )- Each time a new pattern is presented, 

the hidden nodes are initialized with h° k (i) = 1/H. 

Online Learning. Eq.(8) has a slight disadvantage, because it requires to 
remember the final average mean internal states < hk(i) > A of K pattern pre- 
sentations for one update step. While a computer has no problems in fulfilling 
this requirement, the brain could lack the possibility to temporarily store all 
< hk(i) > A, s. This limitation can be overcome by deriving an on-line learning 
rule from scratch, which uses only one pattern at once and takes the form 
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7 is an update constant which is in general much smaller than e. The derivation 
of Eq.(10) is based on optimization under Karush-Kuhn-Tucker conditions with 
positivity constrains, following the procedure described on pages 1400-1402 in 
[3] and substituting p(s) by <5 SjS t like in Eq.(6). 




Fig. 2. Conditional probabilities or weight vectors p*(s|i) for the online learning (left), 
and for the batch learning algorithm for H = 100 hidden nodes (right). Each vector 
is displayed in a quadratic raster of 16 x 16 pixels. The individual vectors i for even 
and odd input nodes are combined and normalized to a grey value g u between 1 and 
— 1 by means of the transformation g u = p*(2 u — 1 [*) — p*(2u\i), g u = g u / max{|g u |}. 
Parameters for on-line learning were w = 0.9, e = 0.9375, 7 = 0.0005, and T = 2000. 
Parameters for batch learning were w = 0.5, e = 0.1, A = 500 and T = 5620. During 
on-line learning, all training patterns were presented only once. In contrast, training 
patterns were presented repeatedly during 20 learning steps in the batch procedure. 



4 Simulations 

Learning and classification. For the learning, M input nodes are divided 
into M p pattern nodes and M c classification nodes. K tr training patterns vjl' 
together with their correct classification (coded in the firing probabilities in 
the M c input nodes; see Fig.l, left) are presented successively to the network, 
while p{s\i) and h(i) are updated according to Eqs.(7), (8), (9), and (10). 

For the classification run, the network uses only the first M p input nodes 
for pattern input. The first part of the weight vectors p(s\i) are re-normalized 
yielding the new weights p*(s|i) = p(s\i)/ ^«=i p( u H)- The remaining M c 
weight vectors are transposed and normalized yielding the classification weights 
p(i\c) := p(c + M p \i) / P(l + T/ p |«) for c = 1, . . . , M c . From h(i) and p(i\c), 

the probabilities qk{c) = P(*l c )^fc(*) f° r eac h of the K ts test patterns vjf 

to belong to the class c are computed, leading to the predicted classification 
Cfc = argmax c qk(c) The mean classification error e over all patterns is then 
computed by e = 1 /K ts J2k=i d 'c* s ,c fc - 
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Fig. 3. Mean classification error e for the USPS database shown for the batch learning 
rule (dashed line) and for the on-line learning rule (solid line), in dependence of the 
mean number of spikes per input node. Chance level is at 90 percent. For a comparison, 
the dotted line shows the classification performance obtained with a standard nearest 
neighbor algorithm, for which the p(s) for different spike vector lengths T were used. 
Parameters for on-line and batch learning were chosen as in Fig. 2. 



Data Base. We subjected our algorithms to the problem of recognizing 
handwritten digits. The data base from the United States Postal Service (USPS) 
consists of K tr = 7291 training patterns vj[ and K ts = 2007 test patterns vjf. 
Each pattern comprises M p / 2 = 16 x 16 grey scale values (pixels) ranging from 
— 1 (white) to 1 (black). During the test run, the patterns vjf were applied 
according to Eq.(5), leading to input rates rjf . During the training run, however, 
the patterns vjf are first normalized and duplicated according to Eq.(5). Together 
with the correct assignment cjj f € {0, ... ,9} to one digit class, the input rates 
to all M = Mp +10 nodes s are defined as 

1 M p 

r k {s) = w r{ r (s ) / 4 r O) for s £ I 1 * M p\ ( n ) 

/ u=i 

and r k (s) = (l — w)S s+ M p , c t k r otherwise. ( 12 ) 

The weighting parameter w controls the balance between the pattern and clas- 
sification inputs. 



5 Results 

We applied the learning and reconstruction algorithms to the USPS database, 
varying the parameters e, 7 , and w to achieve the best possible performance. 
In Fig. 2, the comparison between the weights shows that the on-line learning 
rule leads to the formation of digit templates, whereas the batch learning rule in 
addition extracts typical features common to more than one digit. Consequently, 
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batch learning is slower during the first 0.3 spikes per input node, but achieves 
a lower classification error in the long run (Fig. 3). The minimum error of 8.9% 
shows that the algorithm performs suitably well, but does not reach entirely 
the performances of other classifiers around 4— 5% (for an overview, see [7]). 
We also subjected our algorithms to different types of noises in order to test 
for robustness of learning and classification: first, a varying number of rows or 
columns in the digit image has been occluded by setting the corresponding pixels 
to a value of 0. Up to a number of about 6 rows or columns, the classification 
error nevertheless remains below 20 percent (Fig.4, left). With increasing cov- 
erage, vertical occlusion has a stronger impact on recognition because the digit 
patterns normally occupy an area whose height is larger than its width. Second, 
we superimposed each digit pattern Vk(u) with an image entirely consisting of 
random pixel values v™ d (u) uniformly distributed between —1 and 1. The noise 
level was varied by means of a parameter r] £ [0, 1] by combining the original 
pattern and noise as (1 — rf)vk{u) +rjvl™ d (u). Fig.4 (right) shows that it requires 
a fair amount of noise of rj sa 0.35 to increase the error rate to values above 20 
percent. 





Fig.4. Minimum classification error in dependence on the number of horizontally (light 
bars) or vertically occluded (dark bars) lines in the digit images (left), and in depen- 
dence on the amount of noise r/ on the digit pixels (right). Parameters for the learning 
were chosen as in Fig. 2. 



6 Summary and Discussion 

In this contribution, we have developed a framework which can explain the 
tremendous speed of our brain in processing and categorizing natural stimuli, 
provided that the neural hardware can realize such a generative model. Our 
’Spike-by-Spike’-network is able to classify patterns with less than one spike 
from each input node, despite the randomness in the information transmission. 
In addition, we presented two algorithms for on-line and batch learning being 
capable of finding suitable representations for arbitrary pattern ensembles. In 
general, the update of the hidden nodes occurs in real time more frequently when 
the number of input nodes increases. This implies that the required number of 
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mathematical operations per unit time increases, too. However, this potential 
problem for a machine can be solved by parallel processing in the brain. 

Further development of our algorithms will focus on classifying mixtures 
of patterns and on non-stationary pattern presentations. Preliminary studies 
indicate that the Spike-by-Spike algorithm can, under special circumstances, 
extract the different sources which were superimposed on one input pattern 
(blind source separation). As an example for non-stationary stimuli, it is possible 
to learn and to estimate the intended arm movement of a neural prosthesis from 
the spike data recorded in the motor system of primates [5]. 

A similar approach for classifying temporal patterns from one input chan- 
nel has been investigated by Wiener and Richmond [10]. They use an itera- 
tive Bayesian scheme to successfully re-estimate the presence of a specific time- 
varying stimulus with each incoming spike. 

While the brain is highly modular, our Spike-by-Spike network is only a 
two-layered system with no hierarchy. Therefore, it remains to show that these 
networks can be used like logical modules in a computer, grouping arbitrary func- 
tional units together in order to realize more complex computations. First results 
with hand-codec! weights are very promising (Ernst, Rotenmmd, and Pawelzik; 
submitted to Neural Computation), but still a suitable learning algorithm for 
layered networks has to be found. 
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Abstract. When an object moves, it covers and uncovers texture in the back- 
ground. This pattern of change is sufficient to define the object’s shape, veloc- 
ity, relative depth, and degree of transparency, a process called Spatiotemporal 
Boundary Formation (SBF). We recently proposed a mathematical framework 
for SBF, where texture transformations are used to recover local edge segments, 
estimate the figure’s velocity and then reconstruct its shape. The model predicts 
that SBF should be sensitive to spatiotemporal noise, since the spurious transfor- 
mations will lead to the recovery of incorrect edge orientations. Here we tested 
this prediction by adding a patch of dynamic noise (either directly over the fig- 
ure or a fixed distance away from it). Shape recognition performance in humans 
decreased to chance levels when noise was placed over the figure but was not 
affected by noise far away. These results confirm the model’s prediction and also 
imply that SBF is a local process. 



1 Introduction 

The Peacock Flounder can change its coloration such that there are no easily detectable 
differences in luminance, color, or texture patterns between itself and its surroundings, 
rendering it almost invisible. When the flounder moves, however, it is immediately and 
easily visible. This and similar observations suggest that patterns of change over time 
may be sufficient to visually perceive an object. 

This observation was formalized by Gibson [1], who claimed that the pattern of 
texture appearances and disappearances at the edges of a moving object (i.e., dynamic 
occlusion) should be sufficient to define that object’s shape. Several researchers have 
shown that this pattern is indeed sufficient for humans to properly perceive not only an 
object’s shape, but also its velocity, relative depth, and degree of transparency [2], The 
process of using this dynamic pattern to perceive the shape of an object is referred to 
as Spatiotemporal Boundary Formation (SBF). The types of transformation that lead to 
SBF extend well beyond simple texture appearances and disappearances, however, and 
include changes in the color, orientation, shape, and location of texture elements [3-5], 
The use of dynamic information to define a surface avoids many of the problems in- 
herent in static approaches to object perception, and offers a robust way of determining 
most of the properties of an object from very sparse information while making few as- 
sumptions. Machine vision implementations of SBF could be a welcome addition to 
current object perception techniques. 
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In a first step towards such an implementation, Shipley and Kellman [5] provided a 
mathematical proof showing that the orientation of a small section of a moving object (a 
“local edge segment”, or LES) could be recovered from as few as three element trans- 
formations. Briefly, each pair of element transformations is encoded as a local motion 
vector. The vector subtraction of two local motion vectors yields the orientation of the 
edge. This model predicts that LES recovery, and thus all of SBF, will fail if the ele- 
ments are spatially collinear, a phenomenon which Shipley and Kellman subsequently 
psychophysically demonstrated [5], Likewise, the model suggests that the recovery of 
the orientation of an LES is sensitive to the spatiotemporal precision of the individual 
transformations. That is, the more error there is in knowing where and when the ele- 
ments changed, the more error there will be in the recovered orientation. This implies 
that SBF should be very sensitive to the presence of dynamic noise (element transfor- 
mations not caused by dynamic occlusion). Finally, Shipley and Kellman’s proof also 
showed that one should be able to substitute the object’s global velocity vector for one 
of the local motion vectors (which we will refer to as velocity vector substitution), in 
which case only two element transformations are needed to recover an LES. 

Cunningham, Graf and Biilthoff [6-8] revised this proof and embedded it in a com- 
plete mathematical framework for SBF. With this framework, the complete global form 
and velocity of a surface moving at a locally constant velocity can be recovered. The 
framework consists of three stages. The first stage is similar to Shipley and Kellman’s: 
The orientations of the figure’s edges (the LES’s) are recovered by integrating element 
transformations from a local neighborhood in space-time. The elements’ locations and 
the times when they were transformed are encoded relative to each other (i.e., the frame- 
work is agnostic on the actual representational format of the changes; it does require 
them to be encoded as motion vectors). In the second phase, the orientations of the 
LES’s are used in conjunction with the relative spatiotemporal locations of the element 
transformations to recover the global velocity of the figure. This process, which requires 
at least two LES’s of differing orientations, is mathematically very similar to that used 
to recover an LES’s orientation. If all of the orientations are the same, one can only 
recover that portion of the global velocity that is perpendicular to the LES’s (this is the 
well-known motion aperture effect). Finally, the global motion of the figure, the ori- 
entations of the LES’s, and the locations of the element transformations, are used to 
determine the minimum length of each LES necessary to cause those transformations, 
as well as the relative locations of the LES’s. To complete the process, the LES’s may 
be joined to form a closed contour using a process similar to illusory contour perception 
(for example, see [9-11]). 

Cunningham et al. [12] explicitly tested whether humans can take advantage of the 
velocity vector substitution process predicted by the model. To do this, they added a set 
of additional texture elements that had the exact same velocity as the moving shapes. 
The same set of additional elements was used for all shapes, so they did not provide 
additional static shape information. Cunningham et al. found that the extra motion in- 
formation did indeed improve shape identification performance, but only if the new 
elements were seen as being on the surface of the figure. That is, velocity vector sub- 
stitution is possible, but only when the extra element motion is seen as belonging to the 
figure. 
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So far, all of the model’s predictions reflect human performance: Collinearity of 
the transformations prevents SBF [5], identical orientations of the LES’s hinders proper 
velocity recovery [13], and velocity vector substitution can improve the quality of the 
recovered shape [12], What about the prediction that SBF should be strongly affected 
by dynamic noise (i.e., the presence of transformations that are not caused by dynamic 
occlusion)? In an inconclusive test of this prediction, Shipley and Kellman [5] per- 
formed an experiment that included a condition with a large second element field that 
jumped around the screen randomly. The presence of this second field impaired SBF 
(strongly at low element field densities, less so at higher densities). This field may be 
described as a set of individual elements that flicker on and off, creating appearances 
and disappearances similar to those produced by dynamic occlusion (in which case the 
impairment in shape perception demonstrates SBF’s sensitivity to dynamic noise). It 
may also, however, be described as a single element field with a rapidly changing (i.e., 
Brownian) velocity, and thus the impairment could be accounted for by substituting the 
Brownian global velocity vector into the LES recovery stage. This latter explanation 
also accounts for the results in their other experimental conditions, and is consistent 
with Cunningham et al.’s [12] work on velocity vector substitution, and thus is the most 
parsimonious explanation. 

Both Shipley and Kellman’s proof, and Cunningham et al.’s mathematical frame- 
work predict, however, that the flickering elements should impair SBF. Here, we explic- 
itly test this prediction by adding a flickering surface texture (i.e., a patch of dynamic 
noise) to the moving object. Since we can detect the global velocity of dynamic noise 
fields, and since additional, consistent global velocity information improves SBF, the 
motion of a flickering surface texture should provide valid global motion information, 
which should improve SBF. On the other hand, the presence of spurious appearances 
and disappearances (i.e., flickering elements) near the edges of the object should impair 
SBF. As a control condition, we examined the effect of a flickering texture that is far 
away from the moving figure. Since the global motion of a distant texture field does not 
affect SBF [12], the flickering elements should only affect SBF in the control condi- 
tion if the spatial integration window for SBF is rather large (i.e., if SBF is more of a 
“globaf’than a “locaf’process). 



2 Methods 



Ten people were paid 8 Euro per hour to participate in the experiment, which lasted 
about 30 min. Displays were presented on a 17" CRT monitor. Participants were posi- 
tioned approximately 50 cm from the screen. 

The displays consisted of a 14.6 x 14.6 cm field (visual angle of about 16.3°) of 
single -pixel, white dots distributed randomly on a black background. One of ten radi- 
ally monotonic shapes, shown in Figure 1, moved over the random dot field along a 
circular trajectory of radius 5.72°. The shape completed a single circuit of the trajec- 
tory in six seconds. The shapes were identical to those used by Shipley and colleagues 
in their experiments. This set of shapes has been shown to provide a reliable means of 
determining which variables affect SBF [2], 
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Fig. 1. The ten shapes used in the experiment. 

This experiment only used the “unidirectional” type of displays: Whenever the lead- 
ing edge of the form moved over a white dot, the dot was transformed to black and "dis- 
appeared" from the display. When the trailing edge of the form reached the dot, the dot 
was changed back to white, thus "reappearing" in the display. A second type of display 
that is typically used, called a “bidirectional” display, is identical to a unidirectional 
display, with the sole exception that elements may either appear or disappear along any 
edge (i.e., half of the elements are only visible outside of the figure, as in the unidi- 
rectional displays, and half are only visible inside the figure). Well-defined shapes are 
seen in the bidirectional displays, despite the absence of any form of shape information 
except dynamic occlusion [5], Bidirectional displays were not used in the current exper- 
iment for theoretical reasons (i.e., there are some concerns about surface formation, the 
direction of surface binding, and the role of velocity vector substitution in bidirectional 
displays, see Cunningham et al. [12] for more on this topic). The number of dots was 
systematically varied: The background had 100, 200, or 400 elements. 




Fig. 2. Sketch of the three experimental conditions: a) "noise-free": the occluder moves through 
the random dot field along a circular path; b) "noise near": the noise pattern, represented by dark 
black dots inside a square, is superimposed on the moving figure; c) "noise far": the moving figure 
and noise pattern are separated by 180°. 

Three noise conditions were used: a condition without noise, a condition where 
dynamic (i.e., spatiotemporal) noise was placed near the object, and a condition where 
dynamic noise was placed far from the object (see Figure 2). The dynamic noise pattern 
was a set of white dots which appeared at random locations inside a virtual box of 
fixed size. The size of the box was chosen such that it circumscribed the largest object. 
In this way, the object could not be identified using the size or the shape of the noise 
pattern. The number of dots placed inside the box was 15% of the number of dots in 
the background. The noise dots had a limited lifetime - the location of the noise dots 
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was changed every four frames. Note that the noise dots themselves never moved. The 
new location of each dot was chosen to be within the virtual box. In the “noise-near” 
condition, the box containing the noise dots was superimposed on the moving figure. In 
the “noise-far” condition, the noise pattern was placed 180°away from the figure along 
the circular trajectory. 

The experiment consisted of a single block of 90 randomized trials (10 figures x 
3 densities x 3 noise conditions). Participants were asked to identify the shape mov- 
ing in each display using a ten-alternative forced-choice task. Static images of the ten 
possible choices were shown at all times on the left side of the monitor. Each shape 
moved around its circular trajectory until the participant responded or until the shape 
completed two cycles through the trajectory, whichever came first. If the participant did 
not respond before the end of the second cycle, the display was cleared (except for the 
10 reference shapes on the side of the screen), and remained blank until the participant 
responded. 



3 Results 

3.1 Effect of Density 

In both the "noise-free" and "noise far" conditions, there was a significant effect of den- 
sity, consistent with previous work on SBF. All tests were performed using a two-tailed 
t-test for independent samples with equal variances; all conditions for using these tests 
were met. Average performance was significantly higher at density 200 than at density 
100 (all t’s(18)>4.1, all p’s<0.001), and significantly higher at density 400 than at den- 
sity 200 (all t’s(18)>2.1, all p’s<0.05). There was no effect of density for the noise-near 
condition - performance was at chance (the performance level expected from blind 
guessing) at all density levels. This is almost certainly due to a floor effect, meaning 
that improvements in performance would probably have been observed had the task 
been easier or the number of choices greater. 



3.2 Effect of Noise 

At no density level was mean performance in the "noise near" condition significantly 
different from chance (all t’s(9)<=1.8, all p’s>0.1). Moreover, mean performance in the 
"noise near" condition differed significantly from mean performance in the "noise-free" 
condition at all density levels (all t’s(18)>=2.60, all p’s<0.05). At the two higher den- 
sity levels, mean performance in the "noise near" condition also differed significantly 
from mean performance in the "noise far" condition (t( 1 8 )=7. 19, p<0.001 at density 
200; t(18)=12.38, p<0.001 at density 400). At a density of 100, the difference was not 
significant (t( 18)= 1 .80, p>0.05 (n.s.)), but performance in the "noise far" condition did 
vary significantly from chance at this density level (t(9)=2.75, p>0.01). 

The mean accuracies in the "noise-free" and "noise far" conditions were not signif- 
icantly different at any density level (all t’s(18)<=0.7, all p’s>0.2). 
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Fig. 3. Shape identification accuracy plotted as a function of element density for the three condi- 
tions. Error bars represent the standard error. The dotted line represents chance performance. 



4 Conclusions and Discussion 



At all density levels, shape identification performance was reduced to chance when spa- 
tiotemporal noise was placed near the figure, whereas it was unaffected by noise placed 
far away. The fact that the dynamic noise in the “noise-near” condition prevented accu- 
rate shape recognition suggests that the spurious appearances and disappearances were 
being treated as dynamic occlusion signals. This would, according to the model, impair 
LES recovery and prevent shape perception. Since shape recognition performance was 
at chance level in the present experiment, it is possible that the presence of dynamic 
noise prevented SBF from occurring at all. That is, the low signal-to-noise ratio may 
be a signal that the entire SBF process should not be performed. In Shipley and Kell- 
man’s [5] experiment with dynamic noise, however, the noise merely reduced recogni- 
tion accuracy. Since the same shapes, task, density levels, and shape velocity were used 
in both experiments, the differences in shape recognition performance are probably due 
to the differences in the dynamic noise. The dynamic noise patch was much denser in 
the present experiment, and was focused around the shape itself. Thus, it seems that the 
individual noise signals are being integrated with the dynamic occlusion transforma- 
tions, which produces FES’s that are incompatible with the true shape of the moving 
figure, which in turn leads to failures in the subsequent global form reconstruction. This 
suggests that one might use dynamic noise to carefully probe the exact characteristics 
of LES recovery. For example, one might vary the location, density, or distribution of 
noise to precisely determine the spatial and/or temporal integration windows, element 
grouping processes, or global form reconstruction mechanism of SBF. 

It should be possible to implement an iterative consistency filter to remove at least 
some of the inconsistent LES’s, reducing the sensitivity of SBF to dynamic noise. Al- 
though the human visual system does not seem to employ such a filter, machine vision 
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implementations of SBF (such as that by Cunningham et al. [7, 8]) might benefit from 
such a filter. 

The insensitivity of SBF to the velocity of the dynamic noise patch in the noise- 
far condition confirms Cunningham et al.’s [12] claim that only global motion seen as 
coming from a surface’s texture affects SBF. The insensitivity of SBF to the spurious 
transformations produced by distant dynamic noise patch provides evidence that SBF is 
a strictly local process. This finding places convenient restrictions on the LES’s recov- 
ery stage, and eases the computational overhead that would be involved in an iterative 
filter to remove inconsistent LES’s. 

Perceptually, it was clear in these displays that the dynamic noise patch was moving 
coherently as a whole, yet this global motion information did not seem to help SBF. It 
is possible that the global motion pattern did help, but that this positive contribution 
was outweighed by the detrimental effect of the spurious flickering of the noise patch. 
Another interesting possibility is that the improvement in SBF produced by adding a 
coherent surface texture found by Cunningham et al. [12] was not due to the motion of 
the surface texture, but to the motion of the surface texture elements. Since the individual 
elements in the noise patches did not move, there was no motion to disturb SBF (static 
element fields imposed on dynamically defined figures do not affect SBF very strongly 
[14]). 

The results presented here confirm some previously untested predictions of Cun- 
ningham et al.’s model of SBF and provide additional constraints on potential compu- 
tational implementations of SBF. It seems that SBF is a robust method for extracting 
most properties of a moving object from very sparse information while making few 
assumptions about the structure of the world. 
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Abstract. We address the difficulty of image segmentation methods based on the 
popular level set framework to handle an arbitrary number of regions. While in the 
literature some level set techniques are available that can at least deal with a fixed 
amount of regions greater than two, there is very few work on how to optimise the 
segmentation also with regard to the number of regions. Based on a variational 
model, we propose a minimisation strategy that robustly optimises the energy in a 
level set framework, including the number of regions. Our evaluation shows that 
very good segmentations are found even in difficult situations. 



1 Introduction 

Image segmentation has a long tradition as one of the fundamental problems in computer 
vision. Relatively early, the problem has been formalised by Mumford and Shah as the 
minimisation of an energy functional that penalises deviations from smoothness within 
regions and the length of their boundaries [13]. Later, Zhu and Yuille found out that 
this formulation is closely related to the minimum description length criterion and the 
maximum a-posteriori criterion [22] . They presented a new energy functional that unified 
many of the existing approaches on image segmentation. It can be interpreted as the joint 
minimisation of the boundary length (as in the Mumford-Shah functional) and the Bayes 
error in the regions’ interior. This is based on the fact that segmentation is actually a 
clustering problem with a neighbourhood constraint. Since penalising the Bayes error 
is optimal from the statistical point of view, the variational formulation of Zhu- Yuille 
describes the segmentation problem very accurately. 

However, a tricky issue on image segmentation is the representation of regions and their 
boundaries. Although there exist neat energy functionals like the one of Mumford-Shah 
or that of Zhu- Yuille, it is not easy to minimise them in practice. A very nice tool to deal 
with this problem appeared with the introduction of level sets [8,14]. One application 
to image segmentation has been the active contour model [3,4,10], which is completely 
edge based, and therefore a rather local approach to image segmentation. Level set based 
segmentation that takes the region information into account has been proposed later in 
[15] and [5]. Using level sets for image segmentation has many advantages. First of all, 
level sets yield a nice representation of regions and their boundaries on the pixel grid 

* We gratefully acknowledge partial funding by the Deutsche Forschungsgemeinschaft (DFG) 
and many interesting discussions with Mikael Rousson from INRIA Sophia-Antipolis. 
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without the need of complex data structures. This considerably simplifies optimisation, 
as variational methods and standard numerics can be employed. Furthermore, level sets 
can describe topological changes in the segmentation, i.e. parts of a region can split 
and merge. Finally, the possibility to describe the image segmentation problem with 
a variational model increases the flexibility of the model and allows to employ, for 
instance, additional features [1], shape knowledge [1 1,7], or joint motion estimation and 
segmentation [6]. 

The main problem of the level set representation lies in the fact that a level set function 
is restricted to the separation of two regions. As soon as more than two regions are 
considered, the level set idea looses parts of its attractiveness. This is why only a few 
papers focus on level set based segmentation in the case of more than two regions. In 
[21], a level set function is assigned to each region. This framework has been adapted 
to classification in [18]. In another approach, the bi-modal case is extended to tri-modal 
segmentation [20]. Both techniques, however, assume an initially fixed number of re- 
gions. This assumption is omitted in [16] where the number of regions is estimated in a 
preliminary stage by means of a Gaussian mixture estimate of the image histogram. This 
way, the number of mixture coefficients determines the number of regions. However, this 
kind of estimation is only loosely connected to the energy functional that is minimised. 
A considerably different approach is proposed in [19]. Here, the level set functions are 
used in such a way that N regions are represented by only log 2 N level set functions. 
Unfortunately, this will result in empty regions, if less than N regions are present in the 
image. These empty regions have undefined statistics, though the statistics still appear 
in the evolution equations. 

Altogether, the prominence of level set based segmentation is yet lost as soon as more 
than two regions come into play, and other segmentation methods based for instance on 
algebraic multigrid [9] often perform better. The purpose of this paper is to solve the 
remaining problem of the level set framework while saving its advantages. 

We show a way how to minimise the energy of Zhu-Yuille by means of level sets. This 
includes also the minimisation with regard to the number of regions. As the objective 
function can be assumed to have plenty of local minima, we employ multi-scale ideas 
and a divide-and-conquer strategy. The most precarious part of the segmentation, namely 
the determination of the number of regions as well as the initialisation of the level set 
functions, is based on the very robust two-region segmentation which splits a domain 
into two parts in a way that is optimal according to the energy (Section 2). The multi- 
phase level set evolution has then just to adapt the regions in the global scope with 
more than two regions present (Section 3). With this minimisation strategy the level set 
framework can be fully exploited, what leads to excellent segmentation results. This will 
be demonstrated in some experiments in Section 4. 

2 Two-Region Segmentation 

Contrary to the general segmentation problem, two-region segmentation by means of a 
level set framework is well understood. Consider the Bayes error, i.e. the probability of 
misclassified pixels 
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with the probability densities pi = p(; r|L?i) and p 2 = p(x\fi 2 ) of the regions l?i and 
and under the side conditions Q = U 1?2 and (~l fl 2 = 0 , i.e. the regions cover 
the whole image domain 17 and do not overlap. The a-priori probabilities of both regions 
are equal, so P\ = P2 — 0.5. Moreover, instead of minimising the Bayes error directly, 
it is beneficial from the numerical point of view to work on the logarithms. Together 
with a penalty on the length of the boundary P, weighted by the parameter v, this leads 
to the energy functional 



E(f2i, f2 2 ,Pi,P2) = - / log pi dx — / log p 2 dx + v / ds. (2) 

For minimising this energy, now a level set function is introduced. Let <P : f2 — > R. be 
the level set function with <P(x) > 0 if x £ Pl\ and <S>(x) < 0 if x £ Q 2 • The zero- 
level line of <!> is the searched boundary between the two regions. We also introduce the 
regularised Heaviside function H(s ) with lim s _ ) ._ 00 H ( s ) = 0, lim^oc H(s) = 1, and 
7T(0) = 0.5. This allows to rewrite Eq. 2 as 



E($,Pi,P 2 ) = - / H(@)logp 1 + (1 - H($))\ogp 2 - v\VH{$)\ dx. (3) 
J n 

The minimisation with respect to the regions can now be performed according to the 
gradient descent equation 

dt * = H ' & ( l0g f 2 + ^ ' div (]Hf) ) (4) 

where H'(s ) is the derivative of H(s) with respect to its argument. Note that the side 
conditions are automatically satisfied due to the level set representation. 

However, the probability densities pi and p -2 still have to be estimated. This is done 
according to the expectation-maximisation principle. Having the level set function ini- 
tialised with some partitioning, the probability densities can be computed by a nonpara- 
metric Parzen density estimate using the smoothed histogram of the regions. Then the 
new densities are used for the level set evolution, leading to a further update of the prob- 
ability densities, and so on. This iterative process converges to the next local minimum, 
so the initialisation matters. 

In order to attenuate this dependency on the initialisation, two measures are recommend- 
able. Firstly, the initialisation should be far from a possible segmentation of the image, 
as this enforces the search for a minimum in a more global scope. We always use an 
initialisation with many small rectangles scattered across the image domain. 

The second measure is the application of a coarse-to-fine strategy. Starting with a down- 
sampled image, there are less local minima, so the segmentation is more robust. The 
resulting segmentation can then be used as initialisation for a finer scale, until the orig- 
inal optimisation problem is solved. 

Under the assumption of exactly two regions in the image, this framework works very 
well. For some nice results obtained with this method we refer to [17,1]. The only re- 
maining problem is the fact, that the assumption of exactly two regions in an image is 
mostly not true. 




418 



T. Brox and J. Weickert 



3 Multiple Region Segmentation 



For the before-mentioned reasons, the generalised version of the segmentation problem 
with an arbitrary number of regions N will now be considered. The general model is 
described by the energy of Zhu-Yuille [22] 



N 



E(f2i, Pi , N) = '£,[- 



/ log pidx + - ds + A 
' Oi 2 Jn 



(5) 



The additional term of this energy functional penalises the number of regions with the 
parameter A. Now also the number of regions is a free variable that has to be optimised. 
Moreover, this variable is discrete and the increased number of regions is very sensitive 
to different initialisations. Furthermore, the nice splitting into two regions by a single 
level set function as described in the last section is not applicable anymore. 

Reduced problem with N regions. In order to cope with all these additional difficulties, 
the complexity of the problem is first reduced by setting N fixed and assuming that a 
reasonable initialisation of the regions is available. In this case it is possible to introduce 
again a level set based energy functional with a set of level set functions <P, , each 
representing one region as <Pi(x) > 0 if and only if x & fli. 



N 



E{$ uPi ) = Y J - 



/ H($i)\ogpi - -\VH(<Pi)\ dx 
In 1 



( 6 ) 



Note that, in contrast to the two-region case, this formulation does not implicitely respect 
the side condition of disjoint regions anymore. Minimising the energy according to the 
expectation-maximisation principle and the following evolution equations 

4* - «-(«,) (log* - < 7 > 



ensures the adherence to the side conditions at least for the statistical part, since the 
maximum a-posteriori criterion ensures that a pixel is assigned uniquely to the region 
with the maximum a-posteriori probability. The smoothness assumption, however, can 
result in slight overlapping of regions close to their boundaries, like in all existing level 
set based methods dealing with an arbitrary number of regions, beside [19]. If this is not 
wanted in the final result, the pixels of such overlapping areas can be assigned to the 
region, where the level set function attains its maximum value. 

So up to this point we can handle the following two cases: 

- A domain of the image can be split into two parts by the two-region segmentation 
framework. 

- A set of regions can evolve, minimising the energy in Eq. 5, if the number of regions 
is fixed and reasonable initialisations for the regions are available. 

Solving the general problem. By means of these two special cases, also the general 
problem according to the model in Eq. 5 can be solved. Starting with the whole image 
domain Q being a single region, the two-region segmentation can be applied in order 
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Fig. 1. Segmentation of two artificial texture images: In both cases 4 regions were detected. 




Fig. 2. Segmentation of a texture image: 5 regions have been detected. 



to find the best splitting of the domain. If the energy decreases by the splitting, this 
results in two regions. On these regions, again the two-region splitting can be applied, 
and so on, until the energy does not decrease by further splits anymore. With this 
proceeding, not only the optimum number of regions is determined, but also suitable 
initialisations for the regions. Of course, the resulting partitioning is not optimal yet, 
as for the two-region splitting, possibilities of a region to evolve have been ignored. 
However, as the region number and the initialisation are known, the energy can now be 
minimised in the global scope by applying the evolution of Eq. 7, adapting the regions 
to the new situation where they have more competitors. 

This procedure is applied in a multi-scale setting. Starting the procedure as described on 
the coarsest scale, with every refinement step on the next finer scale, it is checked whether 
any further splitting or merging decreases the energy before the evolution according to 
Eq. 7 is applied. So for each scale the optimum N is updated, as well as the region 
boundaries and the region statistics. 

Though a global optimum still cannot be guaranteed 1 , this kind of minimisation avoids 
quite reliably to be trapped by far-away local minima, as it applies both a coarse-to- 
fine strategy and the divide-and-conquer principle. The two-region splitting completely 
ignores the cluttering rest of the image. This consistently addresses the problems of 
optimising the discrete variable N and of not knowing good initialisations for the regions. 

1 This will only be the case, if the simplified objective function at the coarsest scale is unimodal 
and the global optimum of each next finer scale is the optimum closest to the global optimum 
at the respective coarser scale. 
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Fig. 3. Segmentation of a leopard image (colour): 3 regions have been detected. 




Fig. 4. Segmentation of a penguin image (colour): 3 regions have been detected. 



4 Results 

We evaluated this scheme with a couple of artificial and real-world images. In order 
to handle texture and colour images, the features were computed and incorporated as 
described in [1], We also used the local scale measure proposed in [2] as additional 
texture feature. 

As Fig. 1 reveals, the method works fine for the artificial texture images. The optimum 
number of regions has been detected. The same holds for the test image depicted in 
Fig. 2, which is often used in the literature, e.g. in [9]. Often much more difficult, are 
real world images. Comparing, however, the segmentation result of the penguin image 
in Fig. 4 to the result in [12] shows that our method is competitive to other well-known 
methods. While in [12] 6 regions have been detected, the 3 regions found by our method 
are more reasonable. Our level set framework also compares favourably to the algebraic 
multigrid method in [9], as can be observed by means of the difficult squirrel image in 
Fig. 5a. Also Fig. 5b and Fig. 6 show an almost perfect segmentation. 

It should be noted that all parameters that appear in the method have been set to fixed 
values, so all results shown here have been achieved with the same parameters. This 
is important, as of course it is much easier to obtain good segmentation results, if the 
parameters are tuned for each specific image. However, we think that this contradicts 
somehow the task of unsupervised segmentation. 

The algorithm is reasonably fast. The 169 x 250 koala image took 22.5 seconds on an 
Athlon XP 1 800+ including feature computation. 
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Fig. 5. Left: (a) Segmentation of a squirrel image: 2 regions have been detected. Right: (b) 
Segmentation of a koala image (colour): 4 regions have been detected. 




Fig. 6. Segmentation of a castle image (colour): 3 regions have been detected. 



5 Summary 

In this paper we proposed a level set based minimisation scheme for the variational 
segmentation model of Zhu-Yuille. While the popular level set framework has so far only 
been used for two-region segmentation or segmentation with a fixed number of regions, 
we described a way how to optimise the result also regarding the number of regions. 
Moreover, the divide-and-conquer principle provides good initialisations, so the method 
is less sensitive to local minima than comparable methods. All advantages of the level 
set framework are preserved, while its main problem has been solved. The performance 
of the variational model and its minimisation strategy has been demonstrated in several 
experiments. It compares favourably to existing approaches from the literature. 
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Abstract. Compressed domain image retrieval allows image indexing 
to be performed directly on the compressed data without the need of 
decoding. This approach hence provides a significant gain in terms of 
speed and also eliminates the need to store feature indices. In this paper 
we introduce a compressed domain image retrieval technique based 
on the Colour Visual Pattern Image Coding (CVPIC) compression 
algorithm. CVPIC represents an image coding technique where the 
compressed form is directly meaningful. Data that is readily available 
includes information on colour and edge (shape) descriptors of image 
subblocks. It is this information that is utilised by calculating a com- 
bined colour and shape histogram. Experimental results on the UCID 
dataset show this novel approach to be both efficient and effective, 
outperforming methods such as colour histograms, colour coherence 
vectors, and colour correlograms. 

Keywords: Image retrieval, compressed domain image retrieval, mid- 
stream content access, CVPIC 



1 Introduction 

With the recent explosion in availability of digital imagery the need for content- 
based image retrieval (CBIR) is ever increasing. While many methods have been 
suggested in the literature only few take into account the fact that - due to 
limited resources such as disk space and bandwidth - virtually all images are 
stored in compressed form. In order to process them for CBIR they first need to 
be uncompressed and the features calculated in the pixel domain. Often these 
features are stored alongside the images which seems counterintuitive to the 
original need for compression. The desire for techniques that operate directly 
in the compressed domain providing, so-called midstream content access, seems 
therefore evident [9]. 

Colour Visual Pattern Image Coding (CVPIC) is one of the first so-called 
4-th criterion image compression algorithms [12,11]. A 4-tlr criterion algorithm 
allows - in addition to the classic three image coding criteria of image quality, 
efficiency, and bitrate - the image data to be queried and processed directly 
in its compressed form; in other words the image data is directly meaningful 
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without the requirement of a decoding step. The data that is readily available in 
CVPIC compressed images is the colour information of each of the 4x4 blocks 
the image has been divided into, and information on the spatial characteristics 
of each block, including whether a given block is identified as a uniform block 
(a block with no or little variation) or a pattern block (a block where an edge 
or gradient has been detected). 

In this paper we make direct use of this information and propose an image 
retrieval algorithm that allows for image retrieval directly in the compressed 
domain of CVPIC. Since both colour and shape (edge) information is precalcu- 
lated and readily available in the CVPIC domain, a simple combined histogram 
of these can be obtained very efficiently. Exploiting these histograms allows for 
image retrieval based on both colour and shape contents. Experimental results 
obtained from querying the UCID [14] dataset show that this approach not only 
allows retrieval directly in the compressed domain but also clearly outperforms 
popular techniques such as colour histograms, colour coherence vectors, and 
colour correlograms. 

The rest of this paper is organised as follows: in Section 2 the CVPIC com- 
pression algorithm used in this paper is reviewed. Section 3 describes our novel 
method of image retrieval in the CVPIC domain while Section 4 presents exper- 
imental results. Section 5 concludes the paper. 



2 Colour Visual Pattern Image Coding 

The Colour Visual Pattern Image Coding (CVPIC) image compression algorithm 
introduced by Schaefer et al. [12] is an extension of the work by Chen and 
Bovic [2]. The underlying idea is that within a 4 x 4 image block only one 
discontinuity is visually perceptible. 

CVPIC first performs a conversion to the CIEL*a*b* colour space [3] as a 
more appropriate image representation. As many other colour spaces, CIEL*a*b* 
comprises one luminance and two chrominance channels; CIEL*a*b* however, 
was designed to be a uniform representation, meaning that equal differences 
in the colour space correspond to equal perceptual differences. A quantitative 
measurement of these colour differences was defined using the Euclidean distance 
in the L*a*b* space and is given in AE units. 

A set of 14 patterns of 4 x 4 pixels has been defined in [2] . All these patterns 
contain one edge at various orientations (vertical, horizontal, plus and minus 
45°) as can be seen in Figure 1 where + and - represent different intensities. In 
addition a uniform pattern where all intensities are equal is being used. 
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Fig. 1 . The 14 edge patterns used in CVPIC 
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The image is divided into 4x4 pixel blocks. Determining which visual pat- 
tern represents each block most accurately then follows. For each of the visual 
patterns the average L*a*b* values /i+ and /r_ for the regions marked by + and 
- respectively (i.e. the mean values for the regions on each side of the pattern) 
are calculated according to 

Eie+P* , J2je-Pj 

M+ = v , and ^ (1) 

Aie+ 1 A je - 1 

wlrere pt and pj represent the pixel vectors in L*a*b* colour space. 

The colour difference of each actual pixel and the corresponding mean value 
is obtained and averaged over the block according to 

_ £ie+ II Pi - M+ll + Eje- I \Pi - M-ll 

16 U 

The visual pattern leading to the lowest e value (given in CIEL*a*b* AE 

units) is then chosen. In order to allow for the encoding of uniform blocks the 

average colour difference to the mean colour of the block is also determined 
according to 

Evi \\Pi — 1*11 u Evi^* / q \ 

cr = — — where p = ” (3) 

A block is coded as uniform if either its variance in colour is very low, or if the 
resulting image quality will not suffer severely when coded as a uniform rather 
than as an edge block. To meet this requirement two thresholds are defined. 
The first threshold describes the upper bound for variations within a block, i.e. 
the average colour difference to the mean colour of the block. Every block with 
a variance below this value will be encoded as uniform. The second threshold 
is related to the difference between the average colour variation within a block 
and the average colour difference that would result if the block were coded as 
a pattern block (i.e. the lowest variance possible for an edge block) which is 
calculated by 

8 — 0 TTliTly pattern s{ c) (4) 

If this difference is very low (or if the variance for a uniform pattern is below 
those of all edge patterns in which case a is negative) coding the block as uniform 
will not introduce distortions much more perceptible than if the block is coded 
as a pattern block. Hence, a block is coded as a uniform block if at least one of 
the following criteria is met: 

(i) cr < 1.75 

(ii) 8 < 1.25 

We adopted the values of 1.75 AE and 1.25 AE for the two thresholds from [12]. 

For each block, one bit is stored which states whether the block is uniform 
or a pattern block. In addition, for edge blocks an index identifying the visual 
pattern needs to be stored. Following this procedure results in a representation 
of each block as 5 bits (1 + 4 as we use 14 patterns) for an edge block and 1 
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bit for a uniform block describing the spatial component, and the full colour 
information for one or two colours (for uniform and pattern blocks respectively). 

In contrast to [12] where each image is colour quantised individually, the 
colour components are quantised to 64 universally pre-defined colours (we 
adopted those of [10]). Each colour can hence be encoded using 6 bits. There- 
fore, in total a uniform block takes 7 (=1 + 6) bits, whereas a pattern block is 
stored in 17 (=5 + 2*6) bits. We found that this yielded an average compression 
ratio of about 1:30. We note, that the information could be further encoded to 
achieve lower bitrates. Both the pattern and the colour information could be en- 
tropy coded. In this paper however, we refrain from this step as we are primarily 
interested in a synthesis of coding and retrieval. 



3 CVPIC Image Retrieval by Colour and Shape 

We note from above that for each image block in CVPIC both colour and edge 
information is readily available in the compressed form: each block contains 
either one or two colours and belongs to one of 15 edge classes. We propose 
to make direct use of this information for the purpose of image retrieval. In a 
sense our approach is similar to the work by Jain and Vailaya [6] where image 
retrieval is performed based on colour and shape (edge) information. However, 
our method differs in two important aspect. In stark contrast to their work, our 
method runs directly in the compressed domain without any further need for 
calculating these descriptors. Furthermore due to the low dimensionality of our 
features we are able to build a combined colour and shape histogram rather than 
two separate descriptors that need to be re-integrated in the retrieval process. 

It is well known that colour is an important cue for image retrieval. In fact, 
simple descriptors such as histograms of the colour contents of images [16] have 
been shown to work well and have hence been used in many CBIR systems such 
as QBIC [7] or Virage [1]. A colour histogram is built by (uniformly) quantis- 
ing the colour space into a number of bins (often 8x8x8) and counting how 
many pixels of the image fall into each bin. From the description of the CVPIC 
algorithm it can be easily deduced how a colour histogram can be efficiently cal- 
culated there. First, CVPIC colour histograms need only 64 entries since there 
are only 64 colours in the palette used during the encoding. This in turn means 
that the dimensionality is much lower compared to traditional colour histograms 
which again implies that the comparison of these histograms requires fewer com- 
putations. Since each block contains one or two colour indices and an edge index 
an exact colour histogram can be calculated by weighing the respective two 
colours by the number of pixels they occupy. While this method requires fewer 
computations than are needed for obtaining histograms in the pixel domain we 
propose a yet more efficient approach. Instead of applying weights according to 
the layout of each pattern we simply increment the relevant histogram bins for 
each block 1 . 

1 We note, that by doing so we put more emphasis on the colour content of edge blocks 
compared to uniform blocks. 




428 



G. Schaefer 



While image retrieval based on colour usually produces useful results, in- 
tegration of this information with another paradigm such as texture or shape 
will result in an improved retrieval performance. Shape descriptors are often cal- 
culated as statistical summaries of local edge information such as in [6] where 
the edge orientation and magnitude is determined at each pixel location and an 
edge histogram calculated. Exploiting the CVPIC image structure an effective 
shape descriptor can be determined very efficiently. Since each (pattern) block 
contains exactly one (precalculated) edge and there are 15 different patterns a 
simple histogram of the edge indices could be built. However, since both colour 
and shape features are of low dimensionality we propose to integrate them into 
a combined colour/shape histogram rather than building two separate descrip- 
tors as in [6]. We further reduce the dimensionality by considering only 5 edge 
classes: horizontal and vertical edges, edges at plus and minus 45°, and no edge 
(uniform blocks). Thus, we end up with a 64 x 5 colour/shape histogram Hqs{I) 
for an image I: 

1) = Pr((ci = i V C 2 = *) A p € {1,2, 3}) horizontal 

H cs (I)(i, 2) = Pr((ci = i V C 2 = i) A p £ {4, 5, 6}) vertical 

3) = Pr((ci = * V c 2 = i) A p e {7, 8, 9, 10}) - 45° 

H cs (I)(i, 4) = Pr((ci = i V c 2 = i) A p G {11, 12, 13, 14}) + 45° 

5) = Pr((ci = i A p = 15) uniform (5) 

where ci, C 2 , and p are the colour and pattern indices (the patterns are numbered 
according to Figure 1, going from left to right, top to bottom) of a block. 

It should be pointed out that these CVPIC colour/shape histograms H C s(I) 
can be created extremely efficiently. In essence, per 4x4 image block only 1 
addition is needed (to increment the relevant histogram bin). This makes it 
unnecessary to store any information alongside the image as the indices can 
be created online with hardly any overhead to reading the image file. As thus 
it automatically lends itself to online retrieval e.g. of the web which - due to 
the dynamic structure of the Internet - is impossible to achieve with traditional 
index based approaches. 

Two CVPIC colour/shape histograms H C s(Ii) and Hcsih) obtained from 
images I\ and I 2 are compared using the histogram intersection measure intro- 
duced in [16] 

64 5 

scs{h,h) = ^^rcnn(H C s{h){i,j),Hcs{h){i,j)) (6) 

i= 1 j = 1 

which provides a similarity score between 0 and 1 (for normalised histograms). 

4 Experimental Results 

We evaluated our method using the recently released UCID dataset [14]. UCID 2 , 
an Uncompressed Colour Image Database, consists of 1338 colour images all pre- 

Tlie UCID dataset is available from http://vision.doc.ntu.ac.uk/. 



2 




CVPIC Colour/Shape Histograms for Compressed Domain Image Retrieval 429 



served in their uncompressed form which makes it ideal for the testing of com- 
pressed domain techniques. UCID also provides a ground truth of 262 assigned 
query images each with a number of predefined corresponding matches that an 
ideal image retrieval system would return. 

We compressed the database using the CVPIC coding technique and per- 
formed image retrieval using the algorithm detailed in Section 3 based on the 
queries defined in the UCID set. As performance measure we use the modified av- 
erage match percentile (AMP) from [14] and the retrieval effectiveness from [4]. 
The modified AMP is defined as 



MP 



100 



N-Ri 

N-i 



( 7 ) 



with Ri < Ri+\ and 



AMP = - V MP 



(8) 



where Ri is the rank the z-th match to query image Q was returned, Sq is the 
number of corresponding matches for Q , and N is the total number of images in 
the database. A perfect retrieval system would achieve an AMP of 100 whereas 
an AMP of 50 would mean the system performs as well as one that returns the 
images in a random order. The retrieval effectiveness is given by 



REq = 



Egt Ri 
ElV* 



(9) 



where Ri is the rank of the z-th matching image and is the ideal rank of the 
z-th match (i.e. I = {1, 2, ..., Sq}). The average retrieval effectiveness ARE is 
then taken as the mean of RE over all query images. An ideal CBIR algorithm 
would return an ARE of 1, the closer the ARE to that value (i.e. the lower the 
ARE) the better the algorithm. 



Table 1. Results obtained on the UCID dataset. 





AMP 


ARE 


Colour histograms 


90.47 


90.83 


Colour coherence vectors 


91.03 


85.88 


Border/interior pixel histograms 


91.27 


82.49 


Colour correlograms 


89.96 


95.61 


CVPIC colour & shape 


93.70 


57.82 



In order to relate the results obtained we also implemented colour his- 
togram based image retrieval (uniformly quantised 8x8x8 RGB histograms 
with histogram intersection) according to [16], colour coherence vectors [8], bor- 
der/interior pixel histograms [15] and colour (auto) correlograms [5]. Results for 
all methods are given in Table 1. From there we see that our novel approach is 
not only capable of achieving good retrieval performance, but that it clearly out- 
performs all other methods. While the border/interior pixel approach achieves 
an AMP of 91.27 and all other methods perform worse, CVPIC colour/shape 
histograms provide an average match percentile of 93.70, that is more than 2.50 
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Fig. 2. Sample query together with 5 top ranked images returned by (from left to 
right, top to bottom) colour histograms, colour coherence vectors, border/interior pixel 
histograms, colour correlograms, CVPIC retrieval. 



higher than the best of the other methods. This is indeed a significant difference 
as a drop in match percentile of 2.5 will mean that 2.5% more of the whole image 
database need to be returned in order to find the images that are relevant; as 
typical image database nowadays can contain tens of thousands to hundreds of 
thousands images this would literally mean additionally thousands of images. 
The superiority of the CVPIC approach is especially remarkable so as it is based 
on images compressed to a medium compression ratio, i.e. images with a sig- 
nificantly lower image quality compared to uncompressed images whereas for 
all other methods the original uncompressed versions of the images were used 3 . 
Furthermore, methods such as colour histograms, colour coherence vectors and 
colour correlograms are known to work fairly well for image retrieval and are 
hence among those techniques that are widely used in this field. An example of 
the difference in retrieval performance is illustrated in Figure 2 which shows one 
of the query images of the UCID database together with the five top ranked im- 
ages returned by all methods. Only the CVPIC techniques manages to retrieve 
four correct model images in the top five while colour correlograms retrieve three 
and all other methods only two. 

5 Conclusions 

In this paper we present a novel image retrieval technique that operates directly 
in the compressed domain of CVPIC compressed images. By utilising the fact 
that CVPIC encodes both colour and edge information these features can be 
directly exploited for image retrieval by building a combined colour/shape his- 
togram. Experimental results on a medium-sized colour image database show 
that the suggested method performs well, outperforming techniques such as 
colour histograms, colour coherence vectors, and colour correlograms. 

3 Compressing the images to a size similar to the CVPIC images using a standard 
coding technique such as JPEG will result in a further performance drop as has 
been shown in [13] , hence the results presented here are indeed based on a best case 
scenario. 
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Abstract. Irregular pyramids organize a sequence of partitions of im- 
ages in such a way that each partition is deduced from the preceding 
one by union of some of its regions. In this paper, we show how a single 
pyramid can be used to encode redundant subparts of different partitions. 
We obtain a pyramid that accounts for the redundancy of the partitions. 
This structure, naturally called the redundancy pyramid, can be used 
for many purposes. We also demonstrate and discuss some applications 
for studying image sequences. 



1 Introduction 

Image segmentation is an important component of many machine vision appli- 
cations such as object recognition and matching for stereo reconstruction. In 
general, segmentation techniques aim to partition an image into connected re- 
gions having homogeneous properties. 

A major issue with segmentation algorithms is their stability. The partitions 
produced by different segmentation algorithms will be to some extent different. 
The same is true when a single segmentation algorithm is applied on an image 
sequence of a static scene under varying illumination. Comparing and merging 
several partitions seems an obvious way to partially solve the problem of stability. 

Several techniques in computer vision and pattern recognition handle sev- 
eral partitions of images. A combination of different segmentations to obtain the 
best segmentation of an image has been suggested by Clro and Meer [2] based 
on the cooccurrence probabilities of points in partitions. However, they make 
use of small differences resulting from random processes in the construction of 
a Region Adjacency Graph (RAG) pyramid to generate their segmentations. 
Matching segmentations of different images is usually addressed as a pairwise 
problem, without exploiting the redundancy inherent to highly redundant im- 
ages. Recently, Keselman and Dickinson [4] have proposed a method for comput- 
ing common substructures of R AGs, called the lowest common abstraction. They 
try to find isomorphic graphs obtained from different RAGs by fusing adjacent 

* This work was supported by the Austrian Science Foundation (FWF) under grants 
P14445-MAT, P14662-INF and S91 03-N04. 
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regions. While their approach is attractive, it suffers from a certain number of 
drawbacks. When handling real world segmentations, noise can split or merge 
arbitrary regions and the lowest common abstraction can not cope with these 
processes. 

Basically these approaches try to exploit the redundancy of observations, 
which is widely used in robust estimation, and more generally, but implicitly, in 
robust computer vision techniques. In (robust) estimation, redundancy is defined 
as the difference between the number of parameters of a functional model, and its 
number of equations [3]. When the redundancy increases, the computed model 
is not only more precise but also more reliable [3] . 

Our approach is based on basic topology. We exploit the redundant struc- 
tures of topological partitions. The use of this formalism guarantees that the 
proposed theoretical results are independent of the dimension of the space being 
partitioned. In section 2, we propose a set of definitions. After having recalled 
standard definitions in topology, we introduce new basic tools for comparing 
several partitions, the greatest common multiple and the lowest common di- 
visor of partitions, whose definitions and properties are analogous to classical 
definitions on the set of integers. We then propose the definition of a pyramid 
in this framework. In section 3 we propose a fundamental theorem which en- 
ables the definition, based on these concepts, of a structure that plays a key 
role in the comparison of several partitions. We also propose an efficient method 
for constructing an approximation of the redundancy pyramid on a digital im- 
age of dimension 2. In section 4, we propose a proof of concept. The analysis 
of the redundancy of the structure of a segmentation of images in a sequence 
of moving objects in a static background leads to interesting results discussed 
in this section. Very redundant parts are part of a good segmentation of the 
background. Moderately redundant parts are moving objects, with a certain tol- 
erance to pauses during the object’s displacement. This lead to a very reliable 
process of background segmentation on image sequences with drastically varying 
illumination. 



2 Basic Definitions 

We recall here basic definitions from topology and propose new definitions that 
will help to define partitions, pyramid of partitions and the redundancy pyramid. 



2.1 Topology 

A topology on a set If is a family T of subsets of E (the “open” subsets of E) 
such that a union of elements of T is an element of T, a finite intersection of 
elements of T is an element of T, and 0 and E are elements of T- E equipped 
with a topology T is called a topological space. 

A topological space is connected if it cannot be partitioned into two disjoint, 
nonempty open sets. A (topological) subspace G of a topological space If is a 
subset G of E such that the open sets in G are the intersection of the open sets 
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of E with G. The complement of F £ T is the set F = E/F. The sets F are 
called closed sets. 

The interior int(e) of a subset e of E is the largest open set contained in e. 
The closure d(e) of a subset e of E is the smallest closed set containing e. The 
boundary of a subset e of E is the intersection of its closure and the closure of 
its complement. 

We call region a closed connected subset r of E such that int(r ) ^ 0. We 
define the following relations for regions. If r (~l r' ^ 0 and int(r D r') = 0 and 
int(r U r') is connected, we say that regions r and r’ are adjacent, r and r' are 
overlapping if int(r (~l r') ^ 0. If r and r’ are neither adjacent nor overlapping, 
we say that they are disjoint. These definitions are illustrated in Figure 1. In 
the example l.c, the interesection of the two regions is composed of a single 
point which is on the boundary of the union. Thus the interior of their union is 
composed of two connected components, and we say that the regions are disjoint. 




a) Adjacent regions b) Disjoint regions c) Disjoint regions d) Overlapping regions 
Fig. 1. Relations between regions 



2.2 Regional Covers, Divisors, Multiples, and Pyramids 

We define a regional cover of a region / (e.g. the support of an image) as a set Pi 
of regions r ; - £ Pi such that two different regions from Pi are either disjoint or 
adjacent and I = Up/Cj . A regional cover of I is a ’’partition” of / into regions 
whose overlapping parts are thin. 

We will now introduce new concepts that can be interesting when comparing 
several regional covers. Let P,; and P/ be two regional covers of I. We say that 
Pi divides P[ if and only if each region of P- has a regional cover in Pi (i.e. 
each region of P- is equal to the union of adjacent regions of P,). We note, for 
convenience, Pj|P/. Pi is called a divisor of P', and P' is a multiple of Pi. A 
divisor of a regional cover can be obtained by splitting its regions whereas a 
multiple can be obtained by merging its regions. 

The least common multiple of n regional covers P,:,i<j< n of a region I is the 
multiple P of P,:,i<i< n such that any regional cover P[ with P, \ P{ \ P is not a 
multiple of one or more covers Pjj^i- The greatest common divisor of n regional 
covers Pi t \<i< n of a region I is the regional cover P of Pj,i<i< n such that any 
regional cover P- of / with P|P'|Pj is not a multiple of one or more regional 
cover Pj j^i. The least common multiple (resp. the greatest common divisor) of 
a set of regional covers can be seen as the regional cover obtained by intersecting 
(resp. merging) two by two the boundaries of the initial regional covers. These 
definitions are illustrated in Figure 2. 
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a) Two regional covers b) Greatest Common Divisor c) Least Common Multiple 

Fig. 2. The least common multiple and the greatest common divisor of two regional 
covers. 



Irregular pyramids are well studied data structures in computer vision [5, 
1]. They enable the representation of hierarchies of partitions of images. Our 
definition of a pyramid differs slightly from the existing ones in that we use 
regional covers instead of cellular partitions or graphs. This definition, although 
based on the same structure, leads to a simple and elegant formulation, which 
is expressed in a topological framework rather than in a graph framework. 

We define for our purpose a pyramid V as a set of n regional covers V = 
{Li, L n } satisfying Li\L 2 \-..\L n . The regional covers L, are the levels of the 
pyramid, Li is its base level and L n its top level. An example of a pyramid with 
three levels is depicted in Figure 3. 




Fig. 3. A pyramid of regional covers. 

3 The Redundancy Pyramid 

Segmentation processes are noisy processes which can remove arbitrary regions 
or boundaries. The smallest common multiple of a set of covers obtained by 
segmentation is not stable. However, a more reliable manner to analyze common 
substructures of m “noisy” regional covers is to compute all the smallest common 
multiples of certain number i of regional covers. The smallest common multiples 
depend on the covers used to compute them. It then makes sense to compute their 
greatest common divisor Li, which can be seen as the union of their boundary 
points. In this section, we will show that the L* form a pyramid. We will give 
an efficient way to compute this pyramid using digital 2D images. 

3.1 Definition 

The following lemma simply results from the definitions. It enables one to un- 
derstand how the structure of the redundancy pyramid is built. 

Lemma 1. Let T be a set of regional covers P/i<j< mi ■ Let L\ be their greatest 
common divisor. Let _Ft 2 1<i<C 2 be all the possible smallest common multiples 
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of two regional covers, and L 2 be the greatest common divisor of the covers Pf . 
Then we have Li|L 2 . 

The idea of the proof is that any intersection or difference between regions taken 
from different regional covers can be obtained by the union of regions from L\. 
Thus regions of L 2 are equal to nonempty unions of regions of L\ and Li|L 2 - 
The following theorem is fundamental as it shows that the structure of the 
redundancy pyramid is a pyramid. 

Theorem 1. Let T be a set of regional covers f’ 1 Ki<rai , and let 

— L\ be the greatest common divisor of Pf 1<i<Tn , 

— Li with 1 < i is the greatest common divisor of all the least common multiples 
of i regional covers of P i 1 1 < i<TOi . 

Then the set V = {L\, ...,L n } is a pyramid. It is called the redundancy pyramid 

of ' 

Let us note Pj all the smallest common multiples of i regional covers taken from 
the original set, and Li their greatest common divisor. Let Pf +1 be all possible 
least common multiples of two regional covers taken from P- . We remark that 
Li + 1 is equal to the greatest common divisor of the Pj* +1 . Then we can apply 
the lemma 1 in order to prove the inference Lj\Lj+\. As it is true for L\ and L 2 , 
we have L 1 |L 2 |...|L m . Note that by definition L n the least common multiple of 

p 1 

3.2 Construction with Morphological Operators 

The algorithm presented in this section is based on a boundary representation 
of each regional cover of digital images. The idea is that the set of boundary 
points of the level Li of the redundancy pyramid is composed of points which 
are boundary points of i regional covers. Accumulating directly boundary points 
will not lead directly to the construction of the pyramid, as some combinations 
of boundary points can lead to pendant edges or isolated points. A first filter- 
ing is therefore done to remove them. On certain configurations, applying only 
this algorithm is not enough to filter out all undesired edges, but it produces 
satisfying results in most real world situations. A simple example is depicted 
in Figure 4. This figure shows the initial regional covers (’’partitions”) of three 
different projected cubes, similar to the example studied by Keselman et al. [4], 
The redundancy pyramid can be seen on the fourth figure, where edges have been 
colored according to their redundancy. The dark edges are of higher redundancy 
(i.e. 3), and are the common boundaries of the regions of the last level of the 
redundancy pyramid. The other edges have redundancies of 1, as they appear in 
a single image. 

Although the redundancy pyramid can be built using any kind of partitions, 
the implementation of the preceding algorithm is straightforward when dealing 
with digital 2D images. The initial partitions Pi,i<i< n are described by binary 
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a) Cube 1 b) Cube 2 b) Cube 3 c) Redundancy Pyramid 
Fig. 4. Redundancy pyramid of images. 



images (referred to as contour map in the following text) indicating the presence 
of contours at each point, i.e. Pi(x, y) = 1 if the point of integer coordinates (x, y) 
is a contour point for partition i and Pj( x, y) = 0 otherwise. Examples of contour 
maps are drawn in Figure 5. Typically, such images can be obtained by watershed 
transforms [6] or by marking contour points of labeled image partitions. 

The construction of the boundary redundancy pyramid is based on an accu- 
mulation process of the contour maps. The main steps of the pyramid construc- 
tion are: 

— Each contour point of each contour map is accumulated in an image R of 
natural numbers. 

— A hierarchical watershed of R is computed. We use the leveling transform 
of [6] . The advantage of this watershed algorithm is that ones obtain a well 
nested crest network and thus the pyramid in a digital form without using 
extra operations. 

The result of this algorithm is an integer image describing the hierarchical 
watershed. By applying a threshold i to this image, we obtain the contours 
describing the tth level of the redundancy pyramid. This algorithm is not only 
simple but also very efficient. 



4 Application to Motion Analysis and Background 
Segmentation on an Image Sequence 

The initial data of this application is an image sequence obtained by a static 
camera. The captured scene can be subject to drastic illumination changes, and 
moving objects can occlude some parts of the static scene. A good background 
segmentation cannot be obtained from a single image. The main idea here is to 
construct initial segmentations of a certain number of images in the sequence, 
and to compute the redundancy pyramid of these segmentations. The low level 
of the pyramid will give information on the moving object, while the higher level 
of the pyramid will tend to segment the static scene using information merged 
from the sequence. 

The experiment was done on a sequence where illumination varied in a way 
that certain images are saturated, while others are dark. On the sequence, a 
person is moving in front of a static background. The initial regional covers were 
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obtained by computing the watershed of [6] on the modulus of the Deriche’s 
gradient of the initial images, and by keeping the points not corresponding to 
basins. As predicted, certain regions corresponding to the sought background 
segmentation couldn’t be retrieved correctly on all the images. They were either 
split or merged. Some images from the sequence and their initial segmentations 
are presented in Figure 5. 




Fig. 5. Images from the sequence and their partitions (Image sequence provided by 
Advanced Computer Vision (ACV), Vienna). 



The redundancy pyramid of the computed regional covers was computed. It is 
shown in Figure 6. Each image was treated in less than 2s on a laptop computer 
with an AMD Athlon processor at 1.8GHz. The program used was not subject 
to any optimization and can easily be implemented on dedicated hardware in 
real time. 




a) Redundancy pyramid b) Level 12 

Fig. 6. Redundancy pyramid 



c) Level 25 



V' Twl 

d) Level 40 



of the image sequence 



The best segmentation was obtained at an intermediate level of the pyramid. 
This can be explained by the fact that the contours of the background are not 
detected correctly on all the images. The lower levels are very noisy, which is 
due to the over-segmentation of the initial images. However, the trajectory of 
the movement can clearly be seen. The quality of the segmentations obtained 
at intermediate levels is outstanding, considering the initial over-segmentations 
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used. Remark that no parameter was employed for producing the segmentations 
and the pyramid. The only parameter of this method is the Dericlre’s a which 
was equal to 1.5. In conditions not so extreme, the direct application of the 
previous method should result in stable higher levels of the pyramid. A single 
calibration step expressed in a number of frames would then be required in order 
to obtain a segmentation of a static scene of the quality as image c of Figure 6. 

5 Conclusion 

We have proposed new structure, the redundancy pyramid, expressed in a topo- 
logical framework. We proposed an efficient algorithm in order to compute this 
structure on 2D digital images of partitions. It can be used in a wide number 
of applications ranging from segmentation fusion to generic object recognition, 
motion analysis and background subtraction over a sequence of images under 
drastically varying illumination. Some results of the last application were pro- 
posed. This validated the approach in a very complicated case. Future work 
include a statistical evaluation of the approach, the generalization of the algo- 
rithm to higher dimensions, to continuous images, and to images that cannot be 
directly superimposed on one another. 
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Abstract. We consider the task of stereo-reconstruction under the following fair- 
ly broad assumptions. A single and continuously shaped object is captured by two 
uncalibrated cameras. It is assumed, that almost all surface points are binocular 
visible. We propose a statistical model which represents the surface as a triangu- 
lar (hexagonal) mesh of pairs of corresponding points. We introduce an iterative 
scheme, which simultaneously finds an optimal mesh (with respect to a certain 
Bayes task) and a corresponding optimal fundamental matrix (in a maximum 
likelihood sense). Thus the surface is reconstructed up to a projective transform. 



1 Introduction 

Even though stereo-reconstruction is a thoroughly investigated problem of image pro- 
cessing, which had attracted attention for decades, we should admit, that at least a hand- 
ful crucial open problems remain on the agenda. These are mainly modeling problems, 
e.g. modeling stereo reconstruction for complex scenes (many objects, occlusions and 
self occlusions, depth discontinuities etc.) [ 1,4, 5, 6], On the other hand, it is noteworthy, 
that even under much simpler conditions there are still some open questions. 

The aim of our paper is to show, how at least two of them can be solved under not too 
restrictive assumptions. The first one deals with the interplay of surface reconstruction 
and camera calibration. Usually, most approaches for surface reconstruction require 
either rectified images or equivalently, calibrated cameras. On the other hand there are 
many methods to estimate the epipolar geometry, given corresponding image points. 
That is, corresponding image points should meet a certain epipolar geometry and on the 
other hand, to determine a latter one, corresponding image points are needed. 

The second problem arises in most approaches, which try to estimate dense disparity 
(or depth) fields: in order to calculate local qualities for possible matches, they usually 
utilize fix-sized windows. This becomes inaccurate if the surface is not ortho-frontal. To 
improve this, it is necessary to know the local projective transformation (let’s say from 
the left to the right image). But, again, this transformation is determined by the unknown 
surface. 

To overcome these problems we propose a biologically inspired model consisting in 
the following. A surface is described by a field of abstract (binocular) units. The state 
(label) of each unit is a pair of corresponding image points. These units are arranged in 
an abstract regular hexagonal lattice. This allows to incorporate a-priori assumptions for 
the expected surfaces like binocular visibility or continuity/smoothness by either hard 
or statistical restrictions. For instance, to avoid reconstructions with self occlusions, we 
require coherent orientations for the states of elementary triangles of the lattice: the 
states of a triangle of vertices define triangles in the left and right image - which should 
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be coherently oriented. Image similarity as well as consistency with epipolar geometry 
are modeled in a statistical way. For the first one we calculate image similarities for the 
corresponding image triangles - this allows to account (automatically) for local projec- 
tive transforms. Given an epipolar geometry, the consistency of a pair of corresponding 
image points is measured in terms of their distances from corresponding epipolar lines. 
Consequently, we obtain a statistical model for the state field of our units and the images 
parametrized by unknown epipolar geometry (denoted by the fundamental matrix). This 
allows to pose the surface reconstruction as a Bayes task and to estimate the epipolar 
geometry in a maximum-likelihood sense. Some of the internal parameters of the model 
are automatically estimated in a similar way. 

2 The Model 

Let V be a finite set of abstract vertices arranged in a planar and regular hexagonal 
lattice. The edges of this lattice are denoted by e € E and the elementary triangles are 
denoted by t £ T. Each vertex v £ V has a four dimensional integer- valued state vector 
x(v) € Z 4 , which represents a pair of image positions x(v) = (xz(v),x R (v)) in the 
left and right image. A complete state field (or labeling) is a mapping x : V — > Z 4 and 
defines corresponding triangular image meshes. We denote the images by I R and I R 
respectively, e.g. Il{%l) being the intensity or color value of the pixel xl of the left 
image (see Fig. 1). The epipolar geometry is represented in terms of a fundamental 
matrix F. 

Assuming surface continuity and binocular visibility, we consider the following 
statistical model 



p(x, I L ,I R ;F) = p(x; F ) • p(I L ,I R \x) = 



! exp 



-E a (x)-E g (x,F)-E d (x,I L ,I R ) , 



(1) 



where as usual Z denotes a (unknown) normalizing constant. The terms in the exponent 
- usually called energy terms - are described below. 

Let us begin by explaining the a-priori energy E a {x). It expresses our a-priory 
assumptions and is local additive in the triangles and edges of the hexagonal lattice: 



E a {x) = ^2x(x(t)) + S ( x ( e ))- 

t£T c£E 

The first sum is over all elementary triangles and the function % is zero if the abstract 
triangle t and both image triangles aj£,(t) and x R (t) are all coherently oriented. It is 
infinity otherwise - zeroing the probability of a state field in such case. The second sum 
is over all edges of the abstract lattice and the function S is zero if the disparities in the 
vertices connected by the edge e differ not more than a predefined value. It is infinity 
otherwise - again zeroing the probability of a state field in such case. Hence, the second 
sum can be seen as a kind of continuity term. 

The second energy term E g (x,F) in (1) penalizes state fields x, which strongly 
deviate from the epipolar geometry F: 

E g {x,F) = ^-J2 D ( X ( V )’ F )- 
9 v£V 
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It is a sum over all vertices v £ V of our lattice and the function D is simply the 
symmetrized distance between the image positions xl{v), x R (v) and the corresponding 
epipolar lines: 

D(x(y),F ) = D 2 (x l (v), F ■ x r (v)) + D 2 (x l (v) ■ F, x r (v)), 

where D 2 (xJ) is the squared distance of a point x from the line l. 

The third and last energy term F r i(x, //\. 1 R ) in (1) is the data energy 

Ed(x,lL,lR ) = — y 

ad 

ter 

It is a sum over all elementary triangles of the lattice where the function q is a similarity 
measure for corresponding pairs of image triangles. In the simplest case it is the sum 
of squared differences of intensities/colors in the image triangles XL(t) and x R (t). To 
calculate it, the triangles are coherently subsampled and the color values for noninteger 
image positions are obtained e.g. by bilinear interpolation. If non-lambertian reflexion 
is assumed, a more sophisticated similarity measure can be used instead. 

Summarized, we obtain a Gibbs probability distribution of order three: the highest 
order contributions (x and q) are defined on (elementary) triangles. According to the 
well known theorem of Hammersley and Clifford this p.d. is Markovian with respect to 
the hexagonal lattice. 

3 Task Formulation and Solution 

Assuming for a moment that the epipolar geometry is known, we formulate the surface 
reconstruction problem as Bayes decision with respect to the following loss function 

C(x,x') = ^2 ||cc(if) — x\v)\\ 2 7 

v£V 
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which is local additive. Each local addend measures the squared deviation of the esti- 
mated correspondence pair x(v) from the (unknown) true correspondence pair x'(v). 
Minimizing the average loss (i.e. the risk) gives the following Bayes decision [8] 

x*(v ) = ^2 s-p v (x(v)=s | I l ,Ir\F ), 
sez 4 



where p v ( x(v)=s | I R . I R : F ) denotes the marginal a-posteriori probability for the cor- 
respondence pair x(v ) = s in the vertex v, given the images and the epipolar geometry: 

p v (x(v)=s | I l ,Ir\F) = ^2 p{x\Il,Ir-,F). (2) 

x : x(v)=s 

Hence, we need these probabilities for the Bayes decision. It is noteworthy, that the latter 
gives non-integer decisions, though the states are considered as integer valued vectors. 

Let us return to the general case of unknown epipolar geometry. In order to estimate 
the fundamental matrix in a maximum likelihood sense, we have to solve the task 

F* = argmaxVVa; | I L ,I R ;F). (3) 

F x 

Because we don’t know, how to perform the above sum over all state fields explicitly 
(in polynomial time), we propose to use the EM-algorithm in order to solve the problem 
iteratively. The standard approach gives the following task, which should be solved in 
each iteration: 

F new = arg max Vp(i \I L ,I R - F° ld ) ■ \np(x, I L ,I R -,F). 

F 

x 

Substituting our model (1) in the In and omitting all terms which do not depend on F, 
we obtain 

F new = arg max — YVa; | I L ,I R ;F old )y2D(x L (v), x r (v),F). 

F a 9 x v 

It is important to notice, that this step is possible, because the unknown normalizing 
constant Z in (1) does not depend on F. Exchanging the summations we finally obtain 
the task 



F new = arg max V V D(s,F) -p v (x(v)=s \ I L ,I R -,F old ), (4) 

F v sez 4 

where again p v ( x(v)=s \ Il,Ir\ F old ) denotes the marginal a-posteriori probability for 
the correspondence pair x(v ) = s in the vertex v given the images and the epipolar ge- 
ometry F old . For a crisp set of correspondences the fundamental matrix can be estimated 
by standard techniques (see e.g. [3]). Such techniques can be easily extended for our 
case (4): we consider all possible correspondences, each one weighted by its marginal 
a-posteriori probability. It should be remarked, that the model parameters cr g and ad of 
the model can be learned in a similar way [7], 
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Fig. 2. An artificial stereo pair and the estimated epipolar geometry 




Fig. 3. Views of the reconstructed surface 

Summarizing, we see, that both, the Bayes decision for the state field and the maxi- 
mum likelihood estimation of the fundamental matrix can be performed, provided that 
the marginal a-posteriori probabilities for the states i.e. positions of corresponding points 
are known. We don’t know how to perform the summation in (2) effectively. Nevertheless 
it is possible to estimate the needed probabilities using a Gibbs sampler [2]. In our par- 
ticular case we iteratively choose a vertex v, fix the states in all six neighboring vertices 
(denoted by AT (v)) and randomly generate a new state in v according to its a-posteriori 
conditional probability, given fixed states in neighboring vertices 

p v (x{v) | x(Af(v)),I L ,I R ;F ) ~ 

exp i s ( x ( e )) + —D(x(v),F) + — ^2q(x(t),I Ll I R ) . 

t:v£t e:u£e t:v£t 
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Fig. 4. A real stereo pair of a church fragment and the estimated epipolar geometry 




Fig. 5. Views of the reconstructed surface 



According to [2], the relative state frequences, observed during this sampling process, 
converge to the needed marginal probablities. 



4 Experiments 

To compare the surface reconstruction and the estimated epipolar geometry with ground 
truth, we used a stereo pair generated artificially by a ray-tracer (image size 350 x 350, 
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disparity range 18 pixels, pan ±6°). The input images with overlaid estimated epipolar 
geometry are shown in Fig. 2. To check the correctness of the reconstruction results, we 
compared the obtained disparities with ground truth for a set of points. The maximal 
disparity deviation was 4.3 pixels, whereas the mean squared difference for all points 
was 0.2 pixels. Two views of the reconstructed surface are shown in Fig. 3. 

The next example shows results obtained for a real stereo pair of a church fragment 
(Fig. 4). The obtained surface is shown both as a rotated view and as the obtained 
triangular mesh (Fig. 5). To give a better impression, we virtually cut out the deepest 
part of the mesh by an ortho-frontal plane. 



5 Open Questions 

Although our first results obtained with the proposed approach seem to be promising, 
there are at least three open questions. Our approach requires a good initialization for 
both, the initial fundamental matrix and the initial state field. So far we use standard 
methods - like e.g. the maximum a-posteriori decision in a simplified model for the 
mesh. 

The second question regards the maximum likelihood estimation of the fundamental 
matrix. It may be preferable to consider it as a stochastic variable (instead of an unknown 
parameter) and to use sampling for solving a suitable posed Bayes task. 

It might be preferable to use a smoothness energy term instead of the hard continuity 
constraint. Obviously, it is possible to express the deviation from coplanarity for pairs 
of neighboring triangles in terms of their states and the fundamental matrix. 
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Abstract. The “life” of most neural vision systems splits into a one- 
time training phase and an application phase during which knowledge 
is no longer acquired. This is both technically inflexible and cognitively 
unsatisfying. Here we propose an appearance based vision system for 
object recognition which can be adapted online, both to acquire visual 
knowledge about new objects and to correct erroneous classification. The 
system works in an office scenario, acquisition of object knowledge is trig- 
gered by hand gestures. The neural classifier offers two ways of training: 
Firstly, the new samples can be added immediately to the classifier to 
obtain a running system at once, though at the cost of reduced classi- 
fication performance. Secondly, a parallel processing branch adapts the 
classification system thoroughly to the enlarged image domain and loads 
the new classifier to the running system when ready. 



1 Introduction 

The introduction of neural networks to the field of computer vision has brought 
about a change of paradigms: No longer hard-wired knowledge is used to solve 
recognition tasks, instead, domain specific knowledge is acquired from examples, 
in a way both technically easier and cognitively more adequate. However, most 
neural recognition systems are still a half-hearted realization of this idea, because 
knowledge acquisition ends after an initial training phase. To accomplish online- 
learning, three basic requirements have to be fulfilled: (i) Flexibility of the neural 
system to allow the fast incorporation of new knowledge without performing an 
entire training cycle; ( ii ) close to real-time processing within the entire system; 
and (Hi) a subsystem for human-machine interaction that allows to present new 
object knowledge in a natural manner of communication. 

The system proposed in this paper is part an office task assistance system. 
To fulfill the aforesaid requirements, a neural three-stage system is applied that 
combines feature extraction with classification. When trained online, the last 
and most easy to train stage can be quickly adapted to provide a provisional 
solution. While the system is running continuously, in a parallel thread a new 
version of the neural system is trained from scratch and loaded to the running 
system when ready to improve performance. 
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To cope with the requirements of processing speed and human-machine inter- 
action, an attentional subsystem allows the fast localization of regions of interest 
for object classification, and the simultaneous evaluation of pointing gestures to 
establish a common focus of attention (FOA) of human and machine. 

While view-based systems often require large training sets [9,8] to cover the 
variety of possible views, the system proposed here needs only a few frames of 
an object to facilitate interactive online training. This is achieved by artificially 
multiplying the available images to obtain new object views. 

While solutions to several of the subtasks outlined above have been investi- 
gated, integration to larger vision systems is still rarely to be found. The Perseus 
system is able to reference objects a user is pointing at [3]. Ref. [10] proposes a 
communication model between image processing and the adjacent data interpre- 
tation for object recognition. A system for hand tracking and object reference 
that also allows the integration of modalities other than vision was proposed 
for the Cora robot system [13]. The approach presented here goes beyond these 
systems in three aspects: (i) Several vision capabilities are integrated in a com- 
mon framework, (ii) using the neural approach, a single type of system deals 
with two sub-tasks in a unified way, and (in) online learning is realized in a 
human-machine interaction loop. 




Fig. 1. Left: Office scenario, the user is pointing at objects. Right: Processing flow, 
starting with an image of the desk (left). In parallel, the scene is scanned (a) for 
pointing gestures (upper branch) using skin color segmentation and one instance of 
the VPL-classifier, and (b) for known objects (lower branch). If an object is pointed 
at, the “online loop” (right ellipse) is started. Once the referenced object location is 
identified, images are acquired. The database is extended by artificially distorted views 
(scale/shear transformation, translatory offset), then the VPL-classifier employed for 
object recognition is retrained (section 3.2). Note the two instances of the VPL are 
independent from each other. 



The experimental setup is part of the VAMPIRE project (Visual Active Mem- 
ory Processes and Interactive REtrieval). The work is aimed at the development 
of an active memory and retrieval system in the context of an augmented reality 
scenario. An important subtask is the recognition of objects and simple actions 
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Fig. 2. Intermediate processing results. In the upper branch, the input image is pro- 
cessed for object classification, in the lower branch, hand gestures are detected. Objects 
are located by saliency maps, the example shows an entropy map. Hands can be de- 
tected by skin color. Located objects are virtually rotated to a normalized position to 
facilitate recognition. Recognized pointing gestures are “translated” into an attention 
map which shows a “beam” of activation in the pointing direction. The attentional 
subsystem described in [1] establishes the correspondence between pointing direction 
and one of the objects. In the right column, labeled objects and, below, probabilities 
for objects being pointed at are shown. 



in an office environment. Here, a user sits in front of a desk, two cameras observe 
the scene, which permanently classify objects on the desk and interprete pointing 
gestures (Fig. 1 left). If either a completely unknown object is presented or if the 
system does not recognize an already trained object correctly, the user leads the 
system by gestures to train or retrain the object classifier. We will first describe 
the standard data flow of the object recognition (section 2), then in section 3 
the online training triggered by hand gestures. 

2 Object Recognition System 

Fig. 1 depicts the processing flow of the trained recognition system in the left 
ellipse, Fig. 2 shows some intermediate processing results. For object localiza- 
tion, saliency maps are computed using different mechanisms as described in [1], 
Fig. 2 depicts only the “entropy map” as an example, which derives “conspicu- 
ity” of regions from their information content after the algorithm of [4]. The 
method relies on the assumption that semantically meaningful areas have also 
a high information content in the sense of information theory. Several saliency 
maps are integrated in the attentional subsystem to a joint saliency map. Para- 
meterization of this module allows the selection of the scale on which structures 
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are evaluated (here: scale of office objects). Maxima of the joint saliency map 
indicate candidate regions for object classification. The representation of object 
locations by attention maps facilitates the integration of pointing directions to 
establish object reference (Fig. 2, for details see [1]). 

Subsequently, the neural net based VPL-classifier classifies the candidate re- 
gions. The result is either a class number for an already trained object, or a 
reserved class label for “unknown” . The VPL is a neural classification architec- 
ture which is particularly well suited for a fast, online training and retraining 
from a small data base. “VPL” stands for three processing stages, which combine 
feature extraction from the pixel level with classification: Vector quantization, 
PCA and LLM-networks (see [1] and references therein). The VPL-classifier ex- 
tracts features from the input by local PCA, which are subsequently classified by 
a bank of LLM-networks. An overview of the processing flow is given in Fig. 3. 

The VPL is trained as follows: The first level (“V”) uses vector quantiza- 
tion (VQ) to partition the input space. For VQ, the algorithm proposed in [2] 
is employed. In the second level (“P”), for the training data assembled in the 
Voronoi tessellation cells of each of the resulting reference vectors, the principal 
components (PCs) are computed by the neural algorithm proposed in [12] to 
reduce dimensionality. I.e., to each reference vector a single layer feed forward 
network is attached for the successive calculation of the local PCs. In combina- 
tion, the first two processing stages perform local PCA, which can be viewed as 
a nonlinear extension of simple, global PCA [14]. 

On the third processing level, to each PCA-net one “expert” neural classifier 
of the Local Linear Map - type (LLM network) is attached. The LLM network 
is related to the self-organizing map [5], see e.g. [11] for details. After the un- 
supervised training of the first and second level, the LLM-nets are now trained 
supervised. 

The trained VPL-classifier is applied to classify in succession each of the 
candidate regions. Input are the raw pixel data of windows of pre-defined size 
located at the maxima of the saliency map. To each input vector, the best match 
reference vector is selected. Features are then extracted by projection of the 
input onto the local PCs. The overlap with the PCs is the input to the attached 
LLM-net, which yields the final classification. 

3 Online Learning 

If object classification is erroneous or new objects are to be added to the set, 
the user can activate the teaching mode. The teaching mode is realized as a 
finite state machine. It is activated by keyboard input, then the new or wrongly 
classified object on the desk must be indicated by a pointing gesture. The system 
memorizes the position of the object to be learned and starts to acquire shots of 
the object. After a sample image of the object has been taken, the system waits 
for the users hand to reappear and move the object to a different pose. When the 
hand is out of sight again, the system takes the next image automatically, and 
so on. The user decides when all relevant poses have been captured and finally 
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Fig. 3. Left: VPL-classifier. Features are extracted by local PCA and classified subse- 
quently by neural classifiers. Right: Classification rates and training times for different 
size parameters of the VPL-classifier with respect to the number of objects learned 
(denoted within the symbols) using the full training mode. 



ends the procedure by declaring the object either as “new” by giving it a new 
label, or as “improved” by using an old label (again using the keyboard). 

After image acquisition, the system trains the currently used VPL-classifier 
in fast training mode and resumes classification. In parallel, a new VPL-classifier 
is trained in full training mode. In the following, the components of the online 
training are described. 



3.1 Pointing Gesture Recognition 

For pointing gesture recognition, a system proposed in [1] is applied, which can 
be described only in short. It is based on an adaptive skin color segmentation 
motivated by [7] . If a skin colored blob is found, the corresponding image region is 
classified by another instance of the VPL-classifier (“VPL-HAND”), which is not 
connected to the module employed for object recognition. VPL-HAND yields two 
pieces of information: (a) whether the skin-colored blob is a pointing hand at all, 
and, if so, (b) the pointing direction. The attention module then establishes the 
correspondence of the pointing direction and the referenced object, as described 
in [1], 

3.2 Retraining the Classifier 

The fidl training mode of the VPL-classifier comprises the three steps described 
in section 2: VQ, local PCA, and training of the LLM-networks. The novel, 
labeled object views are added to the existing set of training views, then a VPL- 
classifier is trained from scratch. Training time depends approximately linearly 
on the number of objects, most time consuming is the training of the PCA- 
nets. Therefore, the fast training mode leaves the V- and P-level unchanged 
and retrains only the LLM-nets. The method relies on the assumption that the 
existing feature extraction by local PCA is able to capture also the novel object. 
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So, for the newly captured views only feature vectors are computed and added 
to the existing set of feature vectors (which must be memorized). Each of the 
LLM-nets is then trained anew for the best match samples of its reference vector. 
Naturally, recognition rates are not as high as for the fully trained classifier. 

To minimize the effort the user has to spend on teaching the system new 
objects, the set of online-captured training views is artificially expanded by two 
methods: 

Scale/shear expansion: The appearance of objects within the workspace dif- 
fers in scale and shear due to varying camera distance. Since the 3D-position 
of a newly acquired object is known, scaling and shearing transformations can 
be used to generate additional artificial views which cover the range of camera 
distances. 

Translatory offset: The attentional subsystem is not always able to locate the 
reference frame - from which features are extracted — exactly at the object 
center. Therefore, object views with minor translatory offsets are added to the 
training set to improve classification robustness. 

4 Evaluation 

The VPL-classifier was tested for two aspects of online learning: 

— How do different size parameters for the VPL-classifier affect training times 
and classification rates with respect to the number of objects? 

— How do classification rates and training times differ for the full training mode 
and the fast training mode 1 

For systematic evaluation, for each of 12 typical office objects (e.g. stapler, sharp- 
ener, highlighter) a set of 60 images was recorded in the following way: On the 
desk, six fixed positions were marked and the objects were placed at each of 
these positions. Then, 10 arbitrary poses of each object were recorded. The re- 
sulting set of images contains 720 samples of size 61 x 61 pixels. The 120 images 
recorded at a fixed reference position were used for training, the remaining 600 
for testing. 

The size parameters of the VPL are the number of reference vectors Ny, the 
number of local principal components Np, and the number of LLM-nodes, Np. 
So, VPL-size is given in the form Ny-Np-Np. Fig. 3, right, shows the results 
using the size parameters 3-3-20, 5-5-20 and 7-7-20 for the full training mode. 
The 3-3-20 classifier can be trained quite fast, but the recognition rate drops 
significantly as the number of objects to be learned is increased. The 5-5-20 
and the 7-7-20 classifiers have better capabilities to learn more objects at high 
recognition rates, but the computational time needed for training increases. 

Fig. 4 shows recognition rates for classifiers that were first trained in full 
training mode to recognize 2, 4, ... 10 objects and subsequently extended to 
recognize additional objects using the fast training mode. As expected, in all cases 
the recognition rate for a fixed total number of objects is better if all of them 
were trained in full training mode , as compared to some being learned in fast 
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training mode. The recognition rate drops when more objects are trained in fast 
training mode. However, this decay is much smaller if the classifier initially holds 
more objects (acquired in full training mode), because in this case the feature 
extraction covers a larger variety of appearances. While for two initial objects 
(recognized by 100%) performance drops dramatically, for eight objects the still 
good initial performance of 90% only reduces to 82%. In all cases, however, the 
drop in recognition rate after adding just one new object is tolerable. 

Fig. 4 shows the average training times comparing fast and fully trained 
classifiers with respect to different numbers of objects. Training times for fast 
training mode clearly remain below the full training mode. 





Fig. 4. Left: Classification rates for the fast training mode. Each curve visualizes the 
drop of the classification rate starting with classifiers that were fully trained on 2, 
4, 6, 8 and 10 objects, respectively, and extended using the fast training mode. Size 
parameters were 3-8-20. Right: Training times for the fast training mode compared to 
full training mode. The main curve represents the training times for the full training 
mode and the forked curves represent training times for the fast training mode starting 
off with classifiers trained on 2, 4, 6, 8 and 10 objects, respectively. 



5 Conclusion 

We have presented a computer vision system for interactive online object learn- 
ing guided by pointing gestures. The learning mechanism allows both to acquire 
new object knowledge and to improve classifications of already known objects. 
The system relies on a neural classifier, which builds a view based object repre- 
sentation. The system can be adapted fast to obtain a provisional version, while 
a full training is performed in the background. The performance of the provi- 
sional, fast training improves the more objects the system already knows — a 
property which is plausible also from a cognitive point of view. 

An important goal of future research is a “self-diagnosis” of the system, which 
can give confidence values for the systems ability to classify objects correctly. 
By this means, an estimate for the necessary object views could be given dur- 
ing online training, depending on the object’s complexity. Moreover, the system 
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could ask for more training views of objects where classification appears unreli- 
able. The self-diagnosis should also be able to judge whether the existing feature 
extraction is sufficient for a newly acquired object. Thus, it would be possible 
to decide in which cases fast training mode makes sense. Another goal is ac- 
celerating the offline feature adaptation, a promising approach is e.g. proposed 
in [6]. 
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Abstract. Viewer-centered estimation of the pose of a three dimen- 
sional object has two main advantages: No explicit models are needed 
and error-prone corner detection is not necessary. Eigenspace methods 
have been successful in pose estimation especially for faces. However, 
most eigenspace-based algorithms fail if the images are corrupted, e. g. if 
the object is occluded, the background differs from the training images 
or the image is geometrically transformed. EigenTracking by Black and 
Jepson uses robust estimation to find the correct pose. We show that 
performance degrades for objects whose silhouette changes greatly with 
3D rotation. To solve this problem we introduce masks that adapt to 
the estimated object pose. To this end we used hierarchical eigenspaces 
containing both the appearance and mask descriptions. We illustrate the 
improvement in pose estimation precision for some typical objects. 



1 Introduction 

The pose of a known object is needed in many applications, most notably when 
the object is to be manipulated by a robot. In a real world situation the object 
will be part of a complex scene, i.e. the background is cluttered and the object 
may be partially occluded. Further, the location of the object in the scene is 
only vaguely known and the camera could be rotated by a small angle. 

Viewer-centered approaches are often preferred to object-centered approaches 
due to their robustness and straightforwardness. Especially, algorithms using the 
eigenspace method (also PCA or Karlrunen-Loeve transform) are popular since 
they have low complexity. However, PCA is highly sensitive to structured noise 
which can originate from occlusion, different background or geometric transfor- 
mation of the object. 

Several researches use template matching in eigenspace to solve the problem 
of translation. Pentland et al. use modular eigenspaces of nose, mouth and eyes 
for face recognition [1], While making their approach robust to local distortions 
of the image, such predefined regions of interest are not available for general 
objects. Yoshimura and Kanade [2] efficiently find the rotation of a 2D template 
in the image plane through multi-resolutional eigenimages. This approach is 
not tolerant to occlusions and is not easily generalized to 3D objects. Chang 
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for sigma=sigma_max . . sigma_min 
for layer=highest . . lowest 
estimate coefficients c 
estimate similarity transform a 
project a and c to next layer 
endf or 
endf or 

Fig. 1 . Pseudo code of the principle of EigenTracking 



et al. [3] use the energy of each region in eigenspace and its second derivative 
to find the best match. Rotated or scaled versions of the trained objects are 
not found by this approach. Chang et al. also introduce a quadtree-like partition 
technique to achieve robustness against occlusion. There must be some partitions 
without occlusions for this method to work. Olrba and Ikeuchi [4] also partition 
the image into smaller regions centered around corners. Thus, shift invariance 
is given but rotation and scale of the object are not detected. Leonardis and 
Bischof [5] extract several models based on sets of pixels to achieve robustness 
against partial occlusion. The number of models needed depends on the amount 
and structure of the occlusion as well as on the density of pose-relevant pixels 
and can be very high. In a recent work they adapt their algorithm to scaled and 
translated objects [6]. Rotations are not detected. 

Black and Jepson [7] introduce an algorithm for tracking 3D objects by their 
appearance. It simultaneously finds an optimal match for the appearance of the 
object with an eigenspace approach and estimates an affine transform for the 
image. While their EigenTracking algorithm works well for several situations 
we show its limitations for pose estimation of objects whose silhouette varies 
with 3D rotation in a complex scene. To overcome this problem we introduce 
three strategies to introduce masks to EigenTracking. They are based on an 
hierarchical eigenspace approach. 

The remainder of this paper is organized as follows: The second section gives 
a brief overview of the EigenTracking algorithm. Extentions of this algorithm are 
introduced in section three. Section four gives illustrative examples of the im- 
provement through masks in Eigentracking. The fifth section contains concluding 
remarks. 



2 Using EigenTracking for Pose Estimation 

Eigentracking is based on a generalization of the optical flow method. It alter- 
nates between the optimization of eigenspace coefficients and geometric (here 
similarity) transform. To achieve robustness against outliers an error function 
which gives less emphasis to large values than the square error is chosen. The 
amount of error tolerated is controlled by a parameter a which decreases itera- 
tively. Due to the confinedness of optical flow to small deviations a coarse-to-fine 
scheme is used as well. Figure 1 shows pseudo-code for the algorithm. We assume 
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the reader is familiar with the concept of coarse-to-fine optimization. Thus, we 
will only quickly review the estimation steps from [7]. 

2.1 Robust Estimation of Coefficients and Affine Transform 

Instead of the square error usually minimized by PCA Black and Jepson [7] use 
an error function p(x,a): 

E ( c ) = p ( r (*) “ I U c ](*)» a ) (!) 

X 

with the test image r , its reconstruction Uc and image index x. The column 
vectors of U are the eigenimages and c is a coefficient vector. The error function 
p(x, a) and its derivative ip( x , cr) are defined as. 

. . x 2 . . 2a;cr 2 . . 

^ ^ °) = p + X 2 } 2 ■ < 2 > 

To compensate geometric transformations of the test image with respect to 
the training images, the test image is warped by a similarity transform s(x, a), 
with a a four dimensional parameter vector. Further, image regions can be ex- 
cluded from the optimization by a mask m. Including this information into (1) 
we obtain the error function: 

E(c, a) = ^2 m(x) ■ p (r(x + s(x, a)) — [Uc](x), a) (3) 

X 

which depends on the eigenspace coefficients c and the similarity transform a. 
A Gauss-Newton optimization is used to minimize the error by keeping one 
parameter constant while optimizing the other. The mask m can contain the 
outliers found in a previous run of the algorithm or can depend on another 
source of information. We will use the latter in the following section. 

3 Masks in EigenTracking 

The training images used for the pose estimation each contain a single view of 
the object before a uniform background (cf. Fig. 2). Pixels that belong to the 
background in all views have zero variance and thus the corresponding element 
of all eigenimages is zero. Thus, areas which belong to the background in every 
training image are disregarded automatically. 

EigenTracking is proven to work well for rotationally symmetric objects like 
object062. Black and Jepson also showed experiments for hand form tracking 
in front of a uniform background. We found, however, that the pose of objects 
with varying silhouette such as objectOOl from the COIL database, cannot be 
robustly determined by EigenTracking in complex scenes. 

The authors suggested using masks to overcome these limitations but have 
not introduced an algorithm for this. In the following we employ hierarchical 
eigenspaces to estimate a mask for the observed view. 
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0° 45° 90° 135° 180° 



Fig. 2. Two objects taken from the COIL- 100 database [8]. While the shape of 
objectOOl changes greatly with 3D rotation the shape of object062 is close to con- 
stant. 



3.1 Hierarchical Eigenspaces 

The concept of hierarchical eigenspaces is well known from Active Appearance 
Models [9]. First we build two separate eigenspaces: The image eigenspace with 
transform U i as before and a second eigenspace for masks with transform Um ■ 
The coefficients c\ and cm of each view are concatenated to a combined data 
vector rc = [ c\ cm] (Weighting mask coefficients has not improved results) . The 
corresponding transform is r c = U ccc using a reduced number of dimensions. 
Due to the linearity of the transform U c can be split into two parts U c.i and 
[7c, m for images and masks. We can thus reconstruct image and mask with 

f\ = U\Uc,iCc , r M = U M Uc,M c c • (4) 

To retrieve a mask for a test image we first calculate c\ with the robust PCA 
estimation. Since Cm is unknown, rc = [ c\ 0] is used to find cq via the robust 
PCA estimation, marking the Cm as outliers. The mask is reconstructed with (4). 

The reconstructed mask is thresholded to receive a boolean mask which 
marks all pixels that are not likely to belong to the object as outliers. The 
threshold is decreased iteratively during optimization. For the initial guess of 
the coefficients, the mean mask is used. Figure 3 shows an example. 

3.2 Integration into EigenTracking 

There are several possibilities to integrate the mask approach into the Eigen- 
tracking algorithm (cf. Fig. 1): 

— Static Mask. Straightforward integration of masks into Eigentracking leads 
to reconstruction of a mask before the optimization of the eigenspace coef- 
ficients. 

— Concurrent Masking. The mask is updated for every step of the robust PCA 
coefficient estimation. 

— Combined Optimization. Instead of optimizing the coefficients of the image 
eigenspace the coefficients of the combined eigenspace are optimized by ro- 
bust PCA. 
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mean reconst, 

original . 

mask mask 

Fig. 3. First row: view of objectOOl, mean of all masks of that object, reconstruction 
of mask for view. Second row: training mask of view, thresholded mean mask and 
thresholded reconstruction. 




. . . , . . occlusion, background change & 

occlusion & background change . ° 

geometric transform 

Fig. 4. Typical test images of the two objects under consideration. Left: complex back- 
ground and occlusion, right: additionally similarity transform applied. 



4 Results 



Due to the limited space available we show illustrative results for the two objects 
seen in Fig. 2. Performance for other objects is similar, corresponding to the 
amount of variation in the silhouette. We examine two scenarios: For the first 
experiment the background of the images is exchanged with an irregular pattern 
and the objects are occluded with another COIL object which covers approx. 
l/& th of the image at a random position. The second scenario involves geometric 
transformation of the images as well. Translations of ±10 pixels, rotation of ±10° 
and scale from 0.9 to 1.1 were randomly applied. Figure 4 shows some examples. 

As expected the robust estimation of the eigenspace coefficients used in 
EigenTracking works well for objects with constant silhouette and fails when 
the shape of the object varies greatly (e.g. objectOOl). The first row of Fig. 5 
shows results for test images with occlusion and background variation. The im- 
provement achieved with the hierarchical approach can be seen in the second 
row. Both mask estimation methods show similar results with Combined Opti- 
mization being slightly better. We show only those results in the following. 
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Fig. 5. Absolute error in degrees for pose estimation. First row: performance of original 
robust PCA. Second row: results for two novel methods. 
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EigenTracking 



Static Mask 



Combined Optimization 



Fig. 6. Accuracy of the pose estimation without geometric transform for the robust 
estimation technique by Black and Jepson and for two enhancements using masks. 
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The second scenario was used to examine the original EigenTracking algo- 
rithm and two approaches with masks: Static Mask and Combined Optimization. 
Figure 6 shows results for objectOOl in the first row and for object062 in the 
second. The higher error as compared to scenario one can be explained through 
the similarity of neighboring views under geometric transformation. 

The execution times of the algorithms vary depending on the number of 
iterations needed for convergence. All experiments were made on an Intel P4 
2.4 GHz machine. The original EigenTracking algorithm takes 3.1 to 4.4 s, the 
Static Mask version 4.5 to 5.2 s, Concurrent Masking 4.7 to 6.1s and the Com- 
bined Optimization approach approx. 10 s. 



5 Concluding Remarks 

We have shown that the EigenTracking algorithm by Black and Jepson is useful 
for pose estimation of 3D objects in cluttered scenes and under geometric trans- 
formation. However, the approach works well in complex scenes only for objects 
whose silhouette does not vary much depending on the view. We introduced 
three different methods to include masks for the object shape into EigenTrack- 
ing. They improve the accuracy of the pose estimation significantly, especially 
when no geometric transform of the object is present. Considering the algorith- 
mic complexity, concurrent estimation of an appropriate mask during the robust 
estimation of eigenspace coefficients is recommended. In the future we aim to 
replace the geometric transform estimation with a more robust approach than 
Optical Flow. 
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Abstract. We propose a new approach for the reconstruction of 3D 
CAD models from paper drawings. Our method uses a combination of 
the well-known fleshing-out-projections method and accumulation tech- 
niques from image processing to reconstruct part models. It should pro- 
vide a comfortable method to handle inaccuracies and missing elements 
unavoidable in scanned paper drawings while giving the user the chance 
to observe and interactively control the reconstruction process. 



1 Introduction 

In the past a lot of research and development work was done to hud solutions for 
the challenging task of reconstructing 3D CAD models from 2D drawings given 
either on paper or in a machine-readable format like DXF. 

Two main techniques can be found in the literature: The first is the fleshing- 
out-projections principle published by Wesley and Markovsky [6] which delivers 
surface models. The second are algorithms which focus on the extraction of 
manufacturing features 1 directly from the 2D input data (e. g. [1,3,5]). 

Both require high-quality complete vector data. Therefore vectorization tech- 
niques are used to convert bitmap data into vector data, but this is a very com- 
plicated process which requires a great amount of user interaction and parameter 
adjustment to get satisfying results. Nevertheless it is practically impossible to 
extract exact data from paper drawings since the drawings themselves are im- 
perfect. The complete description of a part can only be obtained by interpreting 
the geometric sketches together with the textual information like measures or 
symbols. 

Another group of methods wich are of interest in this context perform the 
extraction of manufacturing features directly from 3D boundary-representation 
models (for example from IGES or STEP files) [2,4]. 

1 The term “feature” is well-known in the context of image processing and pattern 
recognition. However, it is also used in computer-aided design and manufacturing to 
denote basic construction or manufacturing elements like solids made by sweeping a 
profile along a line or around an axis or shapes resulting from milling or drilling. 
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Fig. 1 . A sample part with three projections and its pseudo-wireframe. 



Our solution tries to overcome the difficulties associated with the vector- 
ization by directly combining the 2D bitmap data in 3D space and extracting 
manufacturing features from this generated data. 



2 Basic Idea 

In Fig. la,b a sample 3D part with three projections (left, top, front) is shown. 
In these projections all edges are drawn as solid lines regardless of visibility. 
The first stage of the fleshing-out-projections algorithm [6] is applied to these 
projections to construct a so-called pseudo-wireframe also shown in Fig. lc. This 
pseudo-wireframe is different from the wireframe of the original part because it 
contains some additional vertices and edges, which are discarded in later stages 
of the algorithm. But to the human observer there is a great amount of similarity 
between the part and the pseudo-wireframe. The basic structure and shape of 
the part is clearly visible. 

The basic idea behind the creation of the pseudo-wireframe is to combine the 
projections in 3D space to get candidates for vertices and edges. A somewhat 
similar technique can be applied also to the bitmap data instead of the vector 
data: Fig. 2a shows scanned pencil drawings of the three projections with some 
perturbations and changes in the quality of lines as they usually appear in CAD 
drawings. Now the following is done: The projections are mapped to the XY, XZ 
and YZ planes of a coordinate system in a 3D lattice so that every projection is 
properly aligned with each other. That means that for example the X coordinates 
of the XY and XZ qprojections of a 3D point are the same. Now we create a 3D 
voxel structure from these projections by adding the intensities of corresponding 
projection image pixels (and normalizing the resulting intensity range to [0.0, 1.0] 
through division by the maximum value afterwards) . 

Fig. 2c shows the resulting histogram of voxel intensities for our sample part 
(scaled in y-direction by factor 50). We distinguish four ranges: 0 to 0.5, 0.5 
to 0.7, 0.7 to 0.9 and 0.9 to 1. They can be interpreted as containing different 
elements of the voxel model: Suppose our images would only contain black lines 
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Fig. 2. Three projections of a sample part scanned from paper, the resulting voxel 
model (intensities from 0.0 to 0.45) and its histogram, scaled by factor 50 in y-direction. 



(intensity 0) on a white background (intensity 1). By adding the pixel intensities 
and performing a normalization as described above we would get four different 
voxel intensities: 0, | and 1. Obviously we would have a small number of 

voxels with intensity 0 and a very large number with intensity 1 (for this reason 
the histogram has to be scaled to make the ranges visible). If we take the voxels 
from the first range, we get the structure shown in Fig. 2b. To get a better quality 
for later recognition tasks, here we took the range 0 to 0.45. Such a correction 
could easily be done manually by the user according to his visual impression. 

Obviously there is a remarkable similarity between the pseudo-wireframe 
constructed above and the resulting voxel structure. Indeed the following can be 
shown: If we have projections of a part, then two voxel sets can be obtained: The 
first by creating the pseudo-wireframe and rastering its edges into a 3D voxel 
space (intensities 0 and 1, painting a voxel black if it is intersected by a line), the 
second by rasterizing the projections into 2D bitmaps and creating a 3D voxel 
model as described above. If we extract the sets of black voxels V\ and V 2 from 
both models, then V\ is a subset of V 2 . 

By using this method it should be possible to get 3D data to which real 3D 
features can be fitted without an intermediary vectorization which can introduce 
an additional loss of information. Besides this, the voxel model could give a per- 
son whose task is to make a 3D part from a drawing a first impression of the 
structure of the part and could be used as a component of a comfortable user 
interface to a feature recognition system working on this data. The user gets the 
ability to interact with and guide the system supporting his work. This is useful 
because it seems quite sure that a fully automatic recognition/reconstruction 
system cannot be built due to the great variety of paper drawings and the prob- 
lems concerning the quality of drawings. 

One practical problem should be mentioned here: For processing larger 
scanned drawings a great amount of memory would be necessary if we would 
store the voxels in a three-dimensional array. But this can be overcome by first 
determining the interesting range from the histogram (which can be calculated 
without storing all voxels) and afterwards storing only those voxels which lie in 
this range in an appropriate data structure. 
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3 Fitting Features Using Accumulation 

To fit features in the 3D space we use an accumulation technique from image 
processing. Its basic principle is known as randomized Hough transform ( RHT , 
see [7]). In contrast to the (conventional) Hough transform as it is used for 
example to dectect lines in a 2D set of points, which calculates and accumulates 
several parameter tuples for every single point (in [7] this is called a diverging 
mapping) the RHT determines and accumulates one parameter tuple for chosen 
sets of multiple points ( converging mapping). Using this method the peaks in 
the accumulator space become significantly sharper (for the above mentioned 
case of detecting lines the RHT accumulator contains the squared values of the 
Hough accumulator). Furthermore, in [7] it is shown that it is not necessary 
to accumulate over all possible point subsets, even the accumulation over an 
adequate number of randomly chosen subsets gives sharp peaks, which decreases 
the running time. 

The RHT method can be generalized to other objects which can be described 
by a set of parameters. The drawback is that if we have some more parameters 
we also need a larger accumulation space in which only small regions are really 
used. One possible solution to overcome this problem is doing the accumulation 
not in the accumulation space itself but in lower-dimensional projections of it. 

The drawback of this variant is the loss of information caused by the pro- 
jection process to lower-dimensional parameter spaces. It can be very hard resp. 
impossible to find the right peaks in these projections since there can be pseudo- 
peaks which arise simply through adding all hits along a projection line instead 
of being produced by existing objects. 

Another solution to the above mentioned problem is a clustering technique: 
Instead of using an accumulator array we maintain a list of clusters. Our al- 
gorithm for this accumulation process (using the random selection of points) is 
given in the following. It is controlled by three functions and three values (we 
denote with p a parameter tuple (pi, . . . ,Pk))' 

— A test function t(p) with the return values ’’valid” and ’’invalid”, which 
checks if a tuple p is a valid parameter tuple for an object we search for, 

— a quality function p,{ p) which determines a properly defined quality measure 
for the object determined by p, 

— a distance function d( p, q) which determines the distance of two parameter 
tuples, 

— a number step max which gives the maximum number of point tuples to be 
chosen, 

— a minimum quality measure p m i n which gives a minimum quality for an 
object to be accumulated, 

— a minimum distance d m i n to consider two parameter tuples to be different. 

The algorithm gets a set of points as input and delivers a list of accumulated 
parameter tuples (together with a weight and quality for each). Especially the 
quality function /r is of great importance, since it constrains the number of clus- 
ters being created during the accumulation process. Since for every parameter 
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tuple created for a randomly selected point set the difference to every existing 
cluster has to be calculated, the running time of the algorithm would make it un- 
suitable for practical application if there were no criterion to eliminate unusable 
clusters. In the above example for accumulating lines such a quality function 
could count the number of points near the line described by two parameters. 

1. Set step = 0. 

2. Increment step by 1. If step > step max the algorithm is completed. 

3. Otherwise choose a tuple of n points randomly and calculate the parameter 
tuple p. 

4. If the test function t( p) yields ’’invalid” go to 2. 

5. If the quality function /z(p) yields a value lower than a minimum quality 
p min then go to 2. 

6. If the list of accumulation clusters is empty then initialize the list with p as 
the first element (with weight 1) and go to 2. 

7. Search the accumulation list for a tuple q with a distance d( p, q) < d min . If 
there are several such tuples, choose one with a minimum value d{ p, q). 

8. If such a tuple q is found (we denote its weight with w q ) then replace it by 
q' = (w q q + p)/ (w q + 1) with the new weight w q > = w q + 1 and go to 2. 

9. If no such tuple is found then add p with the weight 1 as a new tuple to the 
accumulation list and continue with 2. 

3.1 Boxes 

With our accumulation method not only geometric primitives but also complex 
shapes can be found, for example axis-parallel rectangular boxes described by 
six parameters p = (x min , x maxi y m in, Umax, z min , z max ). We search for boxes 
the edges of which are contained in our voxel data. To achieve this we define the 
quality function in an appropriate manner: If we have a box described by the 
six parameters we divide every edge in a number of cells and check how many 
percent of these cells contain data points (here we use 10 cells per edge). As 
a distance between two tuples we simply use the maximum distance between 
corresponding parameters: 



d(p,q) = max \pi - qi\ ■ 

i— 1 , ,6 

To get the six parameters we simply choose n points and determine their 
bounding box. The choice of the number n influences the recognition quality 
and should depend on the kind of data we have: If there is a point set which 
consists only of one box (and our task is only to determine its parameters) we 
should choose a higher number of points to increase the chance to hit the whole 
box after a few iterations. But if there is additional noise in the data (a lot of 
separate points not being part of the box) we should choose a smaller n since 
the greater n is, the greater is the chance to choose some of the wrong points, 
so there will be a smaller chance to get the correct box. 

We carried out some experiments and found out that 4, 5, and 6 are suitable 
values for n for our purposes. In the following we used n = 6. 




468 



F. Ditrich, H. Suesse, and K. Voss 



Besides the presence of noise, another problem is the correct recognition of 
several objects. If we have a scene with k boxes the probability of choosing n 
points belonging to the same box is (j)" (if chosing the same point multiple 
times is allowed), which significantly decreases with increasing k. Nevertheless 
it makes sense to apply our method to the voxel data: We do not try to find all 
boxes contained in the data automatically but give the user the chance to direct 
the recognition process. In our software prototype it is possible to select a region 
of voxels to be used for recognition. Here we use the fact that a user is able 
to get a principal impression of the model structure and can select interesting 
regions containing boxes. After the accumulation process is done the found boxes 
are presented to the user and he can iterate through them (they can be sorted 
by their qualities or the weights of their clusters) and choose the right ones. A 
sample is shown in Fig. 3a. 

3.2 Cylinders 

Another class of objects which can be found using the accumulation method 
are axis-parallel cylinders. They appear in the voxel structure as two circles on 
parallel planes (assumed the planar faces of the part are also axis-parallel). To 
describe them we use six parameters: The coordinate direction which is parallel 
to the axis of the cylinder (pi), two coordinates to describe the position of the 
axis in a plane perpendicular to the axis direction (p 2 , P 3 ), the radius (p±) and 
two values for the coordinate range the cylinder covers along the axis direction 
(p 5 , po). Of course we have to slightly modify our list accumulation algorithm 
to handle the first parameter correctly: It is a flag describing one of three direc- 
tions, so the new cluster cannot be calculated as described above, instead our 
distance function must guarantee that tuples with different directions are never 
collected in the same cluster. Here the distance between two parameter tuples 
p = {pi,.. . ,pe) and q = {q \, . . . , qg) is defined as 

_ f 00 if Pi ^ <n 

u(P’Ci) — < max | p. _ q t | otherwise • 

^ i= 2,. ..,6 

For the calculation of parameter tuples we choose three points and calculate 
one possible tuple for any of the three possible axis directions. Suppose the axis 
direction is given it is easy to determine the position of the axis and the radius. 
As the covered range we use the appropriate minimum and maximum coordinate 
values of the three points. 

To determine the quality of a parameter tuple we also divide the two circles 
into cells and count how many percent of them contain points (in our prototype 
we use 20 cells for each). 

3.3 Extrusions 

The accumulation cannot only be used to detect shapes but also to find trans- 
formations between point sets. This can be used for example to find features 
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Fig. 3. A recognized box in a selected subset of the voxels from Fig. 2b (see Section 3.1), 
the accumulator content after accumulating translations for an axis-parallel box and a 
recognized translation (see Section 3.3). 



which are created by a translational sweep of a profile (usually called extrusion 
or extrusion feature). To find the corresponding translations we use a three- 
dimensional accumulator and accumulate for every pair of distinct points the 
translation which maps one point onto the other (we do not choose the point 
pairs randomly). If the voxel structure contains profiles which are ’’connected” 
through a translation we should get corresponding peaks in the accumulator. 

But there is one problem: The edges contained in the voxel structure would 
also produce peaks since there are a lot of pairs with points lying on the same 
edge with the same distance. To solve this problem we should check if there 
are a lot of points near the line connecting the two points. An appropriate data 
structure can easily be built from the voxel set in a preprocessing step. 

The list accumulation described above is not so well-suited for this purpose 
since a large number of clusters will be generated which increases the running 
time of the algorithm. 

Fig. 3b shows the content of an accumulator for an axis-parallel box, the size 
of the cubes illustrates the number of hits in the cells. As expected there are 
prominent peaks for the six translations connecting the three pairs of rectangles 
forming the boundary of the box. In Fig. 3c the sample part with a detected 
translation is shown. Since arbitrary translations are allowed there are a lot of 
translations besides the six most prominent which have also a significant number 
of hits in the accumulator. 

Of course we can limit to searching for axis-parallel translations, which could 
be found using three one-dimensional accumulators and would restrict the set of 
pairs which need to be checked. 

The computational effort raised through the checking of all pairs can also be 
decreased by equally thinning the voxel set, if the resulting accuracy is sufficient. 

If some translations are detected for every translation the two sets of points 
which are mapped onto the other by this translation can be built and visualized 
in the user interface (see Fig. 3c). The sets can be further processed for example 
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by determining a supporting plane and extracting contours which can be used 
as a base for an extrusion feature. 



4 Conclusions and Further Work 

Finally we like to mention some problems and give some prospects on further 
work. One very important problem which needs to be solved is the correct align- 
ment of the three views before adding their intensities to get the voxel data. 
Perhaps here also a semi-automatic solution involving user interaction can be 
found. Of course it is possible to extend the set of objects which can be found us- 
ing the accumulation technique described above. The correct placement of cross 
sections within the voxel data is another extension to our paradigm. A further 
important task is the automatic selection of interesting voxel subsets on which 
the accumulation procedure will be applied. Currently this has to be done by 
the user. A first simple solution could be the search for connected components in 
the voxel data, since holes appear in most cases separated from the other edges 
of the part. Ideally the system would be able to find those regions automati- 
cally and to extract the contained features. Then all possible combinations of 
the found features using boolean operations could be built to create the possible 
parts. Perhaps this set could be constrained using geometric properties like for 
example the fact that an isolated cylinder must always be a hole. From each 
of these generated parts the 2D views could be derived and compared to the 
original ones using an appropriate similarity measure. 

The relationship and degree of similarity between the voxel model and the 
pseudo-wireframe resp. between the sets V\ and V 2 from Section 2 is also an 
interesting subject to further theoretical investigations. 
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Abstract. A common approach to 3-D reconstruction from image se- 
quences is to track point features through the images, followed by an es- 
timation of camera parameters and scene geometry. For long sequences, 
the latter is done by applying a factorization method followed by an 
image-by-image calibration. In this contribution we propose to integrate 
the tracking and calibration steps and to feed back already known cam- 
era parameters to both tracking and calibration. For loop-like camera 
motion, reconstruction can thus be optimized by using loop-closing al- 
gorithms known from robot navigation. 



1 Introduction 

Reconstructing 3-D scene geometry and camera parameters from a sequence of 
images is a common problem in many computer vision applications. One of these 
applications, for which the approach described in the following was developed, 
is the computation of light fields [5]. The light field is an image-based scene 
model where a set of original images is used to render new views of a scene from 
arbitrary camera positions. Beside image data and geometry information light 
fields require very accurately determined camera parameters for good rendering 
results. 

If no information is available about the camera pose and internal parame- 
ters they are estimated by so-called structure-from-motion approaches [7]. Using 
feature detection and tracking algorithms point correspondences are established 
between the images of a sequence. These are used by a factorization algorithm 
to simultaneously determine the scene geometry (structure) and camera poses 
(motion) of multiple images. Usually there are not enough point correspondences 
to process the whole image sequence at once using one factorization, therefore 
the camera parameters of the rest of the sequence are computed image by image 
using camera calibration methods. This approach is described in detail in [3]. 

The main problem arising during this extension process is that, though errors 
may be small from one image to the next, they accumulate over a large number of 
images leading to inconsistencies in the geometry reconstruction. In the following 

* This work was funded by the German Research Foundation (DFG) under grant SFB 
603/TP C2. Only the authors are responsible for the content. 



C.E. Rasmussen et al. (Eds.): DAGM 2004, LNCS 3175, pp. 471—479, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




472 



I. Scholz and H. Niemann 



we will consider the case that a hand-held camera is moved in loops around a 
scene, e. g. to view an object from every direction or to get a dense sampling. The 
approach we will introduce was inspired by solutions in the field of simultaneous 
localization and mapping (SLAM) for robot navigation. Here, the goal is to 
generate a globally consistent map of the surroundings of a robot [6], while 
the data from the robot’s sensors, e. g. odometry and a camera, are unreliable. 
Consistency of the map can be established when the robot returns to a previous 
position and recognizes landmarks it has seen before. The accumulated error 
can then be determined and the rest of the map corrected accordingly. For the 
case of 3-D reconstruction we will now use the occurrence of a loop in camera 
movement to update the pose of all previous cameras in the loop. The error 
introduced by this process is reduced by bundle adjustment. 

The idea of using topology information to improve reconstruction was imple- 
mented before in [4] , where a zigzag motion of the camera was utilized to track a 
feature in an increased number of images. In [1] the accumulated reconstruction 
error of a turntable image sequence is distributed to all camera position esti- 
mates by aligning several sub-sequences. A similar distribution of errors is done 
in [9] for image mosaics, although in this case the camera motion is constrained 
to rotations only. 

A description of the linear, integrated structure-from-motion approach of 
tracking, factorization and frame-wise extension will be given in Section 2. The 
closing of loops by information feedback and optimization is the topic of Section 
3, and its experimental evaluation is described in Section 4. A summary and 
outlook to the future are given in the conclusion. 

2 Linear Calibration Process 

The usual processing chain for a 3-D reconstruction of a scene is to first gener- 
ate the required point correspondences for all images followed by the respective 
algorithms for structure-from-motion. In the work at hand we want to demon- 
strate the usefulness of feeding back information from the calibration step to 
the tracking and subsequent calibration. Therefore, tracking and calibration are 
first integrated into a linear processing chain as shown in Figure 1. 

First, feature tracking is done until the number of tracked points reaches a 
lower bound and a factorization is performed for the images so far. In the second 
loop the features are tracked to the subsequent images and a camera calibration 
is applied for each. Thus the camera movement and 3-D points are recovered 
image by image. Last, the reconstruction is optimized by bundle adjustment on 
all camera positions and points. 

The individual steps of this linear processing chain will be described in more 
detail in the following, whereas the extension to an iterative process, including 
information feedback, will be introduced in Section 3. 

2.1 Feature Detection and Tracking 

In order to get accurate point correspondences over a large number of images 
feature detection and tracking are performed using the gradient-based algorithm 
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| Initialize frame number: i := 0 | 




Track point features to frame i and detect new ones 


i := i + 1 


UNTIL min. number of features visible in all frames reached or i = N — 1 


Apply factorization method to first i frames 


WHILE i 


< N 




Track point features to frame i and detect new ones 


Triangulate 3-D points and calibrate frame i 


i i + 1 


| Apply bundle adjustment to all frames and 3-D points | 



Fig. 1. Linear tracking and calibration over N images in two steps: factorization of 
initial subsequence and calibration of subsequent images 

by Tomasi and Kanade [11] and the extension by Shi [8]. In the latter robustness 
is increased by considering affine transformations for each feature window. 

This procedure has been further augmented by a hierarchical approach which 
computes a Gaussian resolution pyramid for each image, thus increasing the 
maximum disparity allowed between two images. A final improvement incorpo- 
rates illumination compensation which solves for many problems occurring in 
environments which are not particularly lighted [14]. 



2.2 Factorization and Calibration Extension 

For the images in the first block of Figure 1 structure and motion in the sequence 
are recovered using a factorization method assuming weak-perspective projec- 
tion [7]. It yields the camera pose parameters for a set of images and the 3-D 
position of each feature visible in every image. In order to gain a perspective re- 
construction of the camera poses perspective projection matrices are constructed 
from the result of the preceding factorization. Since the intrinsic parameters are 
unknown the principal point is assumed to be in the image center. For the fo- 
cal length a rough approximation of the correct one is chosen as described in 
[3] . Camera parameters and 3-D points are then optimized using the Levenberg- 
Marquardt algorithm minimizing the back-projection error. Intrinsic parameters 
are assumed to be constant which results in a small but acceptable error due to 
the wrongly estimated focal length. 

Once this initial reconstruction of the first subsequence is available, it can 
be used as a calibration pattern for calibrating the subsequent images. Features 
which are visible in the next image to be calibrated but whose 3-D positions are 
not yet available are triangulated using their projections in the already calibrated 
images. With these correspondences the camera position can be estimated using 
common calibration algorithms [12], and the result is optimized again by mini- 
mizing the back-projection error. In fact this optimization is accurate enough so 
that for small camera movements it can be initialized with the position of the 
last camera and the calibration step can be omitted entirely. 
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A 




3-D Object 



Correct Camera 



Erroneous Camera 



Total Error 




Starting Camera 



Fig. 2. Example reconstruction of a camera path around an object. Correct camera 
positions are denoted by dotted triangles, erroneous ones by solid triangles. 



2.3 Bundle Adjustment 

The optimization of the camera parameters and 3-D points in the steps before 
was always done for one camera after another and in turn with the point posi- 
tions. In contrast to that the idea of bundle adjustment is to optimize all these 
parameters at once to reduce the back-projection error globally. This straight- 
forward approach, as used in [2] for scene reconstruction, has the disadvantage 
of a very large parameter space to be optimized. Therefore the less complex 
interleaved bundle adjustment [10] is used in the following. 

Bundle adjustment is usually applied to an image sequence as a whole. For 
long sequences with more than 100 images it is very time consuming, especially 
if it is repeated every m images as explained later in Section 3.3. Therefore 
the method was adapted to support the optimization of only a few cameras at 
a time. The camera positions in such a subsequence are optimized jointly but 
without considering the rest of the sequence, while the 3-D points are optimized 
considering all cameras. Thus, back-projection error is only slightly increased for 
cameras outside the subsequence, while it is improved for those inside. 



3 Feedback Loop 

The main problem of the linear calibration process described in Section 2 is that 
small errors from one frame to the next accumulate over time and may thus lead 
to serious displacements of the camera positions. This is demonstrated in Figure 
2, where a camera moves in a circle around an object taking 10 images in the 
process. The correct camera positions are equally spaced around the object, but 
an error of only about four degrees from each camera to the next adds up to 
more than 35 degrees. In order to get a correct reconstruction the circle must 
be closed again by removing this inconsistency. This situation is equivalent to a 
robot moving in a loop through some complex environment, and the approach 
introduced in the following is used similarly for mapping the robot’s environment. 
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(a) (b) (c) (d) 

Fig. 3. (a) Linear, erroneous reconstruction, (b) Loop closed without considering ro- 
tation. (c) Loop closed considering rotation, (d) Final, optimized reconstruction. 

3.1 Closing Loops 

Although in case of a hand-held camera it may be moved back to any earlier 
position, we assume here that TV camera positions form a loop and that camera 0 
follows again on camera TV — 1. In contrast to the linear calibration process used 
before features are now tracked from image TV — 1 to image 0, thus establishing a 
relationship between the two images. By applying the extension step of Section 

2.2 the displacement between the last and the first camera position, Atjv_i, 
is calculated with a much higher accuracy than before when the accumulated 
error was included. Going back to the image of a robot this is the equivalent of 
recognizing a formerly seen landmark. 

Using this new information the task of closing the loop is again formulated 
as an optimization process. For now, only the translation vector of each camera, 
t„, is considered. The displacement vector between two cameras is denoted by 
At ra = At n = t n+ i — 1„ for 0 < n < TV — 1. Additionally, At^r-i constitutes the 
current displacement vector between last and first camera while Atjv-i is the 
corresponding target displacement calculated above. Thus, for 0 < n < TV the 
At„ form the desired set of displacements while the At„ are the displacements 
to be optimized. The residual vector is defined as 

e = ((At 0 - At 0 ) T , (At! - AG) 7 , . . . , (At„ - At n ) T f (1) 

and using the Levenberg-Marquardt algorithm the camera positions t„,n > 0 
are optimized by minimizing the residual e T e. The first camera position t 0 is 
kept unchanged. 

The result of an erroneous, linear reconstruction of an example sequence is 
shown in Figure 3(a). Here, an object was placed on a turntable and rotated 
in 40 steps with one image taken for each. Applying the optimization above for 
closing this circle yields the reconstruction of Figure 3(b), which is obviously not 
satisfactory. The rotations between the displacement vectors At„ do not sum 
up to a full circle, therefore the optimization does not yield a circle either. 

The solution is to incorporate the missing rotation to a full circle into the 
computation of the residual vector. This rotation is calculated as the rotation 
difference between the last and the first camera pose, AR = RoR^y^. Lacking 
any other knowledge we assume that the -^th part of this rotation, AR„, is 
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missing in each displacement vector. Z\R„ is computed using spherical linear 
interpolation [13] on a quaternion representation of Z\R. Thus the new displace- 
ment vectors are computed as 

At n = Z\R ?l R„(t„ + i — t n ). (2) 

Using these new target displacement vectors At n the result improves to that 
of Figure 3(c). The new camera positions were also rotated by Z\R„ so that they 
now face in approximately the correct direction. 

Usually an image sequence does not consist of exactly one revolution around 
an object. More circular camera movements may follow the first one, and in such 
cases it is not desired to change the camera positions in a loop already closed 
before. From there on, the position of a camera once adjusted is kept untouched, 
and the algorithm above is only applied to later cameras. 

3.2 Optimizing Reconstruction 

Changing the camera positions renders the 3-D point positions invalid, as seen 
in Figures 3(b) and 3(c), and they have to be recalculated. This is done by again 
minimizing the back-projection error during an optimization of the 3-D points. 

Finally the result is again optimized globally using bundle adjustment as 
described in Section 2.3. The intrinsic parameters are assumed to be correct and 
bundle adjustment is only applied for the extrinsic parameters. The final result 
of such an optimization is shown in Figure 3(d). If only some cameras of a loop 
were adjusted in the closing step before, only those are optimized now, too. 

3.3 Finding Loops 

In a common application such as scene reconstruction from the images of a hand- 
held camera it is not known when a camera loop has been completed and the 
closing algorithm should be applied. The example of Figure 3 of an object on 
a turntable thus constitutes a special case since the end of the circle is known 
beforehand. For the general case a simple comparison scheme is used. A camera 
position is a neighbour of the current camera if its distance is smaller than k 
times the average distance between two consecutive camera positions and is not 
one of the m last positions, k and in are user-defined values. In order to assure 
that the corresponding images show approximately the same part of the scene a 
maximum viewing direction difference can be defined additionally. 

An unsolved problem using this method is that large displacements, as in the 
example above, are not detected, while the closing algorithm makes the more 
sense the larger the accumulated error. This contradiction will be exemplified in 
the experiments in Section 4. 

4 Experiments 

Measuring the accuracy of a structure-from-motion reconstruction is a difficult 
problem especially for real scenes. The back-projection error is often used as 
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(a) (b) (c) 

Fig. 4. Reconstruction of the Santa Claus image sequence: (a) linear reconstruction, 
(b) only bundle adjustment on loops, (c) loops closed and bundle adjustment. 



Table 1. Back- projection errors and camera position differences for the two example 
image sequences 





back-projection error [pixel] 


position difference 




linear rec 


only bundle 


close+bundle 


linear rec 


only bundle 


close+bundle 


Sequence 1 


1.28 


1.75 


2.06 


11.7 


11.5 


5.18 


Sequence 2 


1.15 


2.96 


3.61 


13.5 


7.49 


8.34 



a measure, but it depends highly on the quality of feature points, and a low 
back-projection error may still not give a satisfactory result. 

Given ground-truth data for the camera positions a direct comparison to 
the reconstruction is possible and more meaningful. Therefore, two example se- 
quences were chosen of an object being placed on a turntable and with a camera 
mounted on a robot arm above the table. Sequence 1 was already shown in Fig- 
ure 3. It consists of 40 images of a coke can, taken during one revolution of the 
turntable. Sequence 2 was taken from a Santa Claus figure with five revolutions 
of the turntable and 40 images each, where the robot arm was moved upward on 
a circle by 3 degrees after each revolution. The result of only a linear calibration 
is shown in Figure 4(a). For the improved calibration loops were detected auto- 
matically every tenth image after reconstruction of the first revolution, yielding 
the much improved results of Figures 4(b) and 4(c). 

For comparison, the ideal camera positions were calculated from the turntable 
and robot arm positions. The reconstruction differs from the ideal one by a 
rotation, translation and scale factor. Using axis-angle notation for the rotation 
the 7 parameters of this transformation are estimated using (again) Levenberg- 
Marquardt to optimally register the two reconstructions with each other. The 
error value for the camera positions is calculated as the average distance of two 
corresponding cameras. 

As mentioned before in Section 3.3 the closing of loops makes the more sense 
the larger the accumulated error. This issue is reflected in the experimental re- 
sults of Table 1. Both the average back-projection errors and camera position 
differences are given for the reconstruction using only bundle adjustment on 
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identified loops and for the whole process of closing loops of Section 3.1. The 
linear reconstruction of sequence 1 has a large accumulated error therefore clos- 
ing loops has a great effect on the position difference while just applying bundle 
adjustment is insufficient to reduce this error. For sequence 2 on the other hand 
the accumulated error is rather low (the gap visible in Figure 4(b)) and thus, 
although this gap is closed for the reconstruction with closing in Figure 4(c), the 
camera position difference is still lower without the closing step. The inaccura- 
cies introduced by closing, represented by the increased back-projection error in 
both sequences, were not compensated sufficiently by bundle adjustment. 



5 Conclusion 

In this contribution we proposed a method for creating a globally consistent 
scene reconstruction from an image sequence of a hand-held camera. Loops in 
the movement of the camera are detected and the accumulated error due to the 
linear calibration process is compensated by closing this loop. This approach 
is used similarly in robot navigation for simultaneous localization and mapping 
(SLAM). The results of each loop are optimized by bundle adjustment. 

Since the closing introduces some error on each camera position it works 
well for the compensation of large errors, but for small displacements using only 
bundle adjustment may yield better results. Thus the main issues for future 
work are the identification of loops despite large errors and the reduction of 
errors introduced during the closing process. 
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Abstract. We present a probabilistic framework for matching of point 
clouds. Variants of the ICP algorithm typically pair points to points or 
points to lines. Instead, we pair data points to probability functions that 
are thought of having generated the data points. Then an energy function 
is derived from a maximum likelihood formulation. Each such distribu- 
tion is a mixture of a bivariate Normal Distribution to capture the local 
structure of points and an explicit outlier term to achieve robustness. We 
apply our approach to the SLAM problem in robotics using a 2D laser 
range scanner. 



1 Introduction 

Matching point clouds is an important problem in computer vision and robotics. 
For example 3D range scans taken from di.erent positions have to be integrated 
to build 3D models. In this case a six-vector of parameters has to be estimated. 
The case of 2D point clouds with three parameters to be estimated (a 2Dtransla- 
tion and a rotation) is significant for the simultaneous localization and mapping 
(SLAM) problem in robotics when using a 2D laser range scanner as sensor. 
Such a device measures range values of a 2D slice of the environment and is a 
common input sensor for mobile platforms. Scenarios in large buildings, where 
long cycles have to be closed, require both accurate scan matching results and 
good estimates of uncertainty to distribute accumulated errors accordingly. At 
the same time the scan matcher should be robust, because real environments 
tend to be non-static: People are moving, doors are opening and closing and so 
on. Such events give raise to outliers and the better the scan matcher tolerates 
this the better it can be used for mapping and localization. 

This paper focuses on 2D point clouds with an Euclidian 2D transformation 
to be estimated. Nevertheless, we claim that our view of the problem is also 
relevant for other cases, for example for the full 3D case. The emphasis is on 
the probabilistic framework that is used to derive our scan matching algorithm. 
In previous work we have shown experimentally the capability of our algorithm 
to build accurate maps in realtime using range scan matching [2]. Here we give 
both a theoretical justification and improvements. 

2 Previous Work 

The standard approach to the rigid registration problem is the Iterated Closest 
Point (ICP) algorithm or variants thereof. The name ICP was introduced in the 
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seminal paper by Besl and McKay [1], but similar ideas were developed inde- 
pendently at the same time by Zhang [21] and Chen and Medioni [5]. We will 
now repeat this algorithm brie.y and thereby introduce the notation utilized in 
the rest of the paper. Although the role of the two point clouds to be matched 
is exchangeable we speak of data points which are to be registered with a model 
or model points. Let x, be a data point and assume that an initial estimate 
of the parameters is given. Then map each x t according to these parameters 
into the coordinate frame of the model. Without explicitly noting, the Xi are al- 
waystlrouglrt of as being mapped by the current parameter estimation in the rest 
of the paper, that is Xj is in effect a function of the transformation parameters. 
The ICP algorithm now iterates two subsequent steps: 

1. Correspondence: For each x t , find the closest point ni; of the model. 

2. Estimate new parameters, such that the sum of squared distances between 
each Xi and m, pair is minimized. Update the x ; according to the new 
parameters. 

There is a closed form solution for the second step. Several researchers noted 
that the convergence properties of this point-to-point approach is poor and that 
point-to-plane correspondences (as used by Chen [5]) perform much better (e.g. 
[17]). In the field of mobile robotics Lu and Milios estimated normals from the 
scan [13] and this way incorporated kind of a point-to-line metric. Robust meth- 
ods typically consist of leaving out points that have too large residuals. 

Fitzgibbon [9] proposed to replace the closed-form solution by a non-linear 
energy function and to use the Levenberg-Marquard (LM) algorithm to minimize 
it. This allowed him to incorporate a robust kernel, as it is used in M-Estimators 
[16, 19]. The immediate problem here is the calculation of first derivatives that 
are needed by the LM algorithm. His solution is to calculate distances to closest 
points on a regular grid (instead of the closest points themselves). This way, the 
spatial derivatives can be calculated numerically at each grid point and inter- 
polated in between. Interestingly, a similar approach using even an octree (but 
without using robust kernels) was already proposed 1992 [4]. Our solution is 
similar in that we use a regular spatial grid. But the information at each grid 
point is much more sophisticated in that it contains also a model of the local en- 
vironment going beyond simple interpolation. For optimization we use Newton’s 
algorithm instead of LM. This algorithm has better convergence properties at 
the cost of requiring “real” second derivatives (compared to the Gauss-Newton 
approximation of the Hessian used in LM). One benefit of our method is that 
these can be calculated analytically. 

3 A Probabilistic Approach 

This section presents a probabilistic interpretation of the ICP algorithm, con- 
stituting the base for the design of a robust energy function that captures the 
local structure of points. As mentioned earlier, the correspondence step pairs 
each point x ; to a model point m,. Assuming that the location of points which 
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correspond to m,i were generated by a normal distributed random process with 
parameters m, and er,;, the likelihood of having measured x ; is: 



p(xi) = exp. 






(1) 



with di = |x ; — mil- So the Normal Distribution can be interpreted 

as a generative process for xp It is assumed that the location of x, ; has been 
generated by drawing from this distribution. 

Now the problem of estimating the transformation parameters can be for- 
mulated as a Maximum Likelihood problem: The parameters are optimal if the 
resulting transformed data points x, maximize the following likelihood function: 



^ = ^exp.- ^ , (2) 

i 1 

Equivalently the negative log- likelihood of 'P can be minimized: 

= ( 3 ) 

i 1 

If the variances are equal for each i this is the energy function that is minimized 
by ICP. In a similar manner a point-to-line measure can be formulated. Then 
di denotes the distance to the line. Solely the domain has to be limited to a 
finite range. Otherwise the integral over the probability function is unlimited 
and cannot be normalized. 

Fig. 1(a) and (b) show the underlying probability density functions (pdf s). 
Both can be generalized by a bivariate Normal Distribution: 

p(x) = exp. — i(x ; — m) t C” 1 (x i — m) (4) 

where C is a symmetric 2x2 matrix. If CD 1 has two equal eigenvalues, it becomes 
the Point-to-Point measure. If one eigenvalue is near zero, the pdf becomes the 
Point-To-Line measure. Figure 1(d) illustrates such a distribution and how it 
can be understood as a product of two univariate Normal Distributions. 

Our view is now as follows: In the correspondence step each data point is 
assigned a generative process in the form of a pdf that is considered to have 
generated its coordinates. This probability function is not necessarily restricted 
to be a Normal Distribution, any pdf that captures the structure of points locally 
well is a candidate pdf. 

Now we deal with how to choose a good probability density function: The pdf 
should be able to approximate common structures (like lines) well and should at 
the same time be robust against outliers. This paper proposes to use a mixture 
of a bivariate Normal Distribution and an uniform distribution. That mixture 
reads: 

p(x) = £iexp.i(x - m) f C^ 1 (x - m) + 6?Wiier , 
where p 0 utiier is the expected ratio of outliers. 



( 5 ) 
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Fig. 1. Some probability density functions which are interpreted as generative processes 
for the points of the point cloud, (a) Point-To-Point (b) Point-To-Line (c) Point-To- 
Line, orthogonal to the line of (b) with larger variance (d) Product of (b) and (c): A 
bivariate Normal Distribution that is determined by a point and a covariance matrix. 



The constants £1 and £2 can be determined by requiring that the probability 
mass of p must equal one in a finite region (for example one by one meter). As 
shown above, the use of a bivariate Normal Distribution allows the modelling of 
points, lines and anything in-between with inclusion of expected variances. The 
influence of data points is therefore weighted in a sound way. On the other hand, 
the log-likelihood of this mixture probability fulfils the requirements for robust 
functions used in M-estimators: It grows sub quadratic ally and the influence of 
outliers vanishes at some point. Figure 2 illustrate these claims. Together, this 
mixture leads to an accurate and robust energy function. 




Fig. 2. Comparison between Gaussian generative process (red dashed) and a mixture 
process with outlier model (blue solid). Likelihood (a), negative log-likelihood (b) and 
influence function (c) which is the derivative of (b). For parameter estimation the 
negative log-likelihood of the generative process is used to build the function that 
is to be minimized. The influence function characterizes the bias that a particular 
measurement has on the solution [3]. For the Gaussian the influence of outliers grows 
without bounds. 
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4 Registration 

To perform the registration, at first generative processes are determined from 
the model’s points in a preprocessing step. For this purpose the 2D plane is 
subdivided into a regular grid. Now each grid point gets a pdf assigned (the 
mixture, see eq. 5) that locally represents the structure of the model point cloud 
around the grid point. For each grid point the parameters of a bivariate Normal 
Distribution are determined by taking into account all the points of the model 
cloud that are in a certain range around it. These parameters are simply deter- 
mined by the mean and the covariance matrix of all these points. The expected 
outlier ratio is set to a constant (we use 0.3). If a density function around a 
grid point should not be approximated well, it will become rather uniform (large 
variances in both principal directions). If it can be approximated well it will 
provide strong constraints by small variance in at least one principal direction. 
At the same time it will still be robust through the outlier term. Figure 3 shows 
some examples. 




Fig. 3. Bivariate Normal Distributions calculated around several example grid points 
in real laser range scans. The cells shown here have a dimension of one by one meter. 
One such distribution is determined by the covariance matrix and the mean of the 
contained points. 



Now the preprocessing is finished and the iterative registration can begin. 
Establishing the correspondence between a point of the data point cloud and a 
pdf is now a simple lookup in the regular grid (which is possible in constant time 
in contrast to finding a closest point). As a point of the data point cloud does 
typically not fall onto a grid point, the pdfs of the four closest grid points are 
assigned to it. This is handled by adding up all the four respective log-likelihood 
terms. These are weighted bi-linearly, which assures a continuous log- likelihood 
function. Summing up all the data point’s log-likelihood terms results in an 
energy function which is to be optimized with respect to the transformation’s 
parameters. 

5 A Computational Advantageous Approximation for 
Optimization 

The summands of the energy function to be optimized consist of terms that 
have the form log(ciexp — i(x — m) 4 C(x — m) + ci- These have no simple 
first and second derivatives. This section presents an approximation that allows 
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a cheap analytical computation of gradient and Hessian (needed by Newton’s 
algorithm). A look at fig. 2(b) suggest, that a robust log-likelihood function 

x 2 

of the form p( x) = — log(cie - ^ + C 2 ) could be approximated by a Gaussian: 
_ <j 2 * 2 

p(x) = die + CZ3 . Parameters di are fitted by requiring that p(x) should 
behave like p{x) for x = 0 and x — > 00 , and additionally p(a) = p(a) (in the 
bivariate case the function’s values are required to be equal at the one sigma 
contour) . The derivatives of this approximation do now have an extremely simple 
form and can be calculated cheaply. Main computational effort is the evaluation 
of only one exponential function per data point to calculate both gradient and 
Hessian. 

6 Some Experimental Results 

First we present some examples. Fig. 4 shows the log-likelihood functions gener- 
ated by several typical laser range scans with bright values meaning high prob- 
abilities, approximated by the method of the last section. Here, black values do 
not stand for zero probability but for the logarithm of the expected outlier prob- 
ability! The typical scan match time (including calculation of the log-likelihood 
functions’ parameters) is under 10 ms if the initial guess by odometry is taken 
into account. The initial error is then only around a few centimeters and some 
degrees in rotation. Newton’s algorithm converges in the majority of cases in two 
to five iterations. The distance between grid points is 50 cm and an environment 
of one by one meter is used to calculate means and covariance matrices. 





Fig. 4. Some example log-likelihood functions around grid points. 



We use the scan matcher as the basis for building maps using a mobile robot 
equipped with a laser scanner as data acquisition platform. Our approach here 
belongs to a family of techniques where the environment is represented by a 
graph of spatial relations obtained by scan matching [14, 11, 10]. A spatial 
relation consists of the parameters estimated by our technique and a measure 
of uncertainty. This measure is provided by the Hessian of the energy function 
at the optimum and is used to distribute errors accordingly through the graph. 
Details and more experimental results can be found in [2]. Figure 5 gives an 
impression of the accuracy of our method. The data set there consist of 600 
scans taken in a large (around 50 by 60 meters) environment with two long 
loops to be closed. 
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x [cm] 



Fig. 5. A map built using our registration technique with data acquired from a 2D 
laser range scanner, visually demonstrating the accuracy of our results. Range scan 
data courtesy of T. Duckett, University of Orebro. 



7 Conclusion 

This paper presented a probabilistic framework for the registration of point 
clouds. The main contributions were: 

1. The concept of explicitly pairing points to probability distributions. 

2. Using a mixture of a bivariate Normal Distribution and an outlier term to 
model local structure of points. 

We further proposed a computational advantageous approximation that allows 
simple calculation of gradient and Hessian. In our approach, each “magic” num- 
ber has a clear defined meaning thanks to the probabilistic framework. 

We applied our technique to the SLAM problem in robotics with excellent 
experimental results on various data sets. Some of the data sets consisted of 
several tens of thousand scans and we take the succesful processing of these sets 
as an experimental proof for the claimed robustness. 

Future work will focus on techniques for robust estimation of the probability 
functions in the regular grid and on higher order models to derive probability 
functions from. Another challenge is the design of mixtures with components that 
adapt over time like it has been done in computer vision applications for tracking 
[12], perhaps also integrating the simultaneous tracking of moving objects. 
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Abstract. This paper describes audio-visual speech recognition experi- 
ments on a multi-speaker, large vocabulary corpus using the Janus speech 
recognition toolkit. We describe a complete audio-visual speech recogni- 
tion system and present experiments on this corpus. By using visual cues 
as additional input to the speech recognizer, we observed good improve- 
ments, both on clean and noisy speech in our experiments. 



1 Introduction 

Visual information is complementary to acoustic information in human 
speech perception, especially in noisy environments. Humans can disambiguate 
an acoustically confusable phoneme using visual information because many 
phonemes which are close to each other acoustically are very different from 
each other visually. The connection between visual and acoustic information in 
speech perception is demonstrated by the so-called McGurk Effect [1], Visual 
information such as gestures, expressions, head-position, eyebrows, eyes, ears, 
mouth, teeth, tongue, cheeks, jaw, neck, and hair, could improve the perfor- 
mance of machine speech recognition [2,3]. Much research has been directed 
towards developing systems that combine the acoustic and visual information to 
improve accuracy of speech recognition [4, 5, 6, 7]. Many of the presented audio- 
visual speech recognition systems work on a very limited domain, i.e. either only 
spelled digits [8,9,10] or letters [11,12,13] are recognized, or only a small vo- 
cabulary is addressed [14]. For large vocabulary audio-visual speech recognition, 
the work by Potamiamos, Neti et al. [15,16,17] has been presented. Their AVSR 
system is probably the most sophisticated system today. 

In this work we also targeting the task of large vocabulary audio-visual speech 
recognition. Our approach is to use the Janus speech recognition toolkit [18,19], 
which was developed in our lab and to integrate visual speech recognition into 
this system. For our experiments we use the data that was used during the 
workshop on audio-visual speech recognition held at John-Hopkins University in 
2000 [15] . In the experiments we observed improvements, both on clean and noisy 
speech, by using visual cues as additional input to the speech recognizer and hope 
to further improvements by an enhanced preprocessing and normalization of the 
data. 
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Fig. 1 . Some pictures of the recorded faces taken from the videos. 



2 Databasis 

The data provided for our experiments consists of nearly 90 GB of video footage 
with an overall duration of about 40 hours. During recording of the videos, the 
speakers were placed in front of a light-colored wall and looked right into the 
camera. All videos have a resolution of 704 x 480 pixels and a frequence of 30Hz. 
Figure 1 depicts some sample pictures from the videodata. 

Audio data was recorded at a sampling rate of 16kHz in a relatively clean 
audio environment. The utterances were made in a quiet office with only the 
noise of some computers in the background. 

The utterances are composed of a vocabulary of about 10500 words. For the 
training we got about 17000 utterances from 261 speakers with a total length of 
about 35 hours. The testset is made of 26 speakers with about 1900 utterances. 
These utterances have a total length of four and a half hours. The exact numbers 
are given in table 1. 



Table 1 . Available audio visual stored utterances for training and test. 



Set 


Utter. 


Duration 


Spk. 


Training 

Test 


17111 

1893 


34.9 h 
4.6 h 


261 

26 



In this work, not all speakers could be used for visual recognition, because 
the extraction of visual cues failed for some speakers. The biggest set we used 
consisted of 120 speakers for training a stream-recognizer (see section 4.3). For 
this system, 17 speakers were used for testing. 

The speakers, for which the detection of the region of interest is most robust 
are selected by an automatic process which considers the variation in the position 
of the mouth, the variation in the width of the mouth and the number of frames, 
where the facial features could not be detected at all. Only those speakers were 
selected, where the respective values are below a certain thresholds. 





490 



J. Kratt et al. 



3 Visual Preprocessing 

In order to use the video images of a user’s lips for speech recognition, the 
lips first have to be found and tracked in the video images. We use the pro- 
gram described in [20] to find eyes, nostrils and the lip corners in the pictures. 
The found lip corners are used to detect the mouth region for the visual train- 
ing/recognition. For this purpose, a square around the corners is taken with 
them at the left and right border at about half of the height. 

To compensate different illumination conditions and different skin tone of 
the subjects in the video images, we normalize the extracted mouth regions for 
brightness. A sample image is depicted in Figure 2. 




Fig. 2. The effects of the normalization of the brightness. 



As the audio processing works with 100 timeslices every second it would be 
best to have a video stream with 100 frames per second, but the videos are 
recorded at a rate of 30Hz. To achieve a signal with 100Hz the existing frames 
are repeated three times and every third frame four times. 

Once having a video stream with a frequency of 100Hz, the pictures are 
cosine transformed and the 64 coefficients with the highest summation over all 
training frames are searched as the best coefficients. During selection of the 
best 64 coefficients the first row and column is ignored because they consist of 
constant informations which gives no information for the shape of the mouth. 
Only the 64 best coefficients are used for training and recognition, they keep 
nearly all information about the video signal. The results are the same when 
taking all 4096 elements but the training and recognition takes much longer. 

As a next step the video signal is delayed by 60ms to achieve better syn- 
chronization of acoustic and visual cues. This step is performed because the 
movements of the lips usually start some time before a sound is produced [12, 
21 ]- 

4 Experiments 

This section describes the audio-visual speech recognition experiments we per- 
formed. The first step in building the audio-visual speech recognizer was to train 
an audio-only recognition system. This was done by using an existing speech 
recognizer to label the transcriptions of the audio-visual data with exact times- 
tamps. Once this was done, a new speech recognizer was trained on all 261 
speakers in the audio-visual data set to get an audio-only reference system. 
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For the training of the visual recognizer, we use only up to 120 speakers. This 
was done because the visual lip-tracking module did not provide useful results 
for all of the video sequences and because training of the visual recognizer was 
quite time-consuming. 

For the visual recognizer, we used a set of 13 visemes, as proposed by [15]. 
Twelve of these visemes were modeled with three states (begin, middle, end), 
resulting in 37 viseme states. 

In the remainder of this section, we describe the different experiments that 
we performed. We then present and discuss the obtained results in Section 5. 



4.1 Concatenation of Feature Vectors 

As a first experiment, we simply concatenated the acoustic and visual input 
features and trained a speech recognizer on the combined feature vector. The 
acoustic part of the input vector consists of 13 cepstral coefficients per frame; 
as visual features, 64 DCT-coefficients are used. In order to provide context to 
the recognizer, five frames before and after the actual one are connected to the 
feature vector, which results in a feature vector with 847 elements. To reduce 
the dimensionality of the feature vector, a linear discriminant analysis (LDA) is 
calculated. The resulting 42 most significant coefficients are then used as input 
feature vector for the recognizer. 

By concatenating the visual and acoustic input features, an acoustic speech 
recognition system can easily be adapted to perform audio-visual recognition 
with little changes. This approach, however, has several drawbacks: First, since 
the feature vector becomes large, training of the system becomes computationally 
expensive. A more severe disadvantage is that the importance or contribution 
of audio- and videodata to the recognition process gets unbalanced, since more 
features are used for the visual input than for acoustic input. Thus, important 
information in the audiodata might get lost by performing LDA. 



4.2 Reducing the Feature Space 

As the first case is not very flexible in changing the given weights for the video- 
and audio data a more flexible approach is needed to combine acoustic and 
visual cues. In [15] a hierarchical LDA approach (HiLDA) to reduce the audio- 
visual feature space was suggested. In this approach, LDA is performed on the 
visual and acoustic input vector separately. The resulting reduced vectors are 
then combined and again LDA on the combined audio-visual feature vector is 
performed. 

This procedure has two advantages: First, the computational load is reduced. 
Now three matrix multiplications are needed for calculating the LDA transfor- 
mation instead of one before, but the matrices are much smaller. As the needed 
operations for a matrix multiplication grow by 0(n 3 ) the overall needed number 
decreases. Second, this approach allows for better adjusting of the weights of 
the different modalities. We obtained good results when first reducing the visual 
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feature vector to only 10 coefficients and the acoustic vector to 90 coefficients. 
During the second LDA step a reduction to 42 coefficients is performed. 

4.3 Stream Recognizer 

While the hierarchical LDA approach gives much more flexibility than a simple 
concatenation, it still has some disadvantages that could be solved by a stream 
recognizer which processes acoustic and visual features independently. For build- 
ing such a system, a separate classifier is trained to compute likelihoods for each 
of the input streams separately. Results are then combined at a later stage. This 
system has proven to give the best results. For the combination the possible 
hypotheses are scored for each stream and then combined by the given stream 
weights. For our system best results were achieved if the audiodata gets weighted 
by 70%. 

As the weights for audio- and videodata are not combined before the recog- 
nition process it is not necessary to train a stream recognizer again because of 
changing the weights. This behavior can save a lot of time while testing different 
scenarios because only the test must be computed for each one. In the two cases 
described before the training must be computed again for each test. 

Another advantage of this additional flexibility is the possibility to automat- 
ically adapting the recognizer to a given environment, e.g. by measuring the 
signal-to-noise ratio of the audio-signal. 

5 Results 

Now the results of the different audio visual speech recognizers are presented. 
As the stream recognizer provides the best results the most detailed results 
are available for this system. Tests on small subsets show the advantage of the 
HiLDA and stream recognizer to the simple feature concatenation attempt. First 
we trained three audio-visual recognition systems and an audio only system 
with 14 speakers. Testing was done on five separate speakers. As can be seen 
in table 2 the audio only system performs best in this case, followed by the 
stream recognizer and the HiLDA approach. The concatenation is the worst of 
the tested scenarios. 

The poor audio-visual recognition results in this case are likely due to the 
little amount of training data. As there is a high variability in the video, more 
training data is needed. 

In our second experiment, we therefore trained the systems with 30 speakers 
(see Table 2). As you can see, now audio-visual recognition outperforms pure 
acoustic recognition, even on clean audio data. Testing was again done on five 
subjects. 

In our last experiment we trained an audio-visual stream recognition system 
with a much bigger training set. Since the acoustic and visual parts of the stream 
recognizer can be trained independently, different amount of training data can 
be used for each modality. To train the acoustic part of the recognizer, we used 
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Table 2. Audio visual word error rates for a System trained on clean audiodata, five 
speakers are used for the test set. 





14 speakers 


30 speakers 


audio 


48.28% 


39.29% 


concat 

HiLDA 

stream 


51.36% 

49.16% 

49.28% 


40.11% 

38.48% 

38.24% 



Table 3. Audio visual word error rates for a System trained on 120 speakers for the 
video part and all 261 speakers for the audio case. 



stream weights 
audio: video 


clean 

audio 


noisy 

audio 


100:0 (audio-only) 


25.26% 


53.94% 


90:10 


24.78% 


51.19% 


80:20 


24.30% 


48.53% 


70:30 


24.10% 


47.37% 


60:40 


24.29% 


48.05% 


50:50 


25.52% 


52.72% 



all 261 speakers. For the visual part, we used only those 120 speakers, were 
the automatic tracking of the lips performed the best. Testing was done on 17 
subjects. 

Table 3 depicts the recognition results depending on the stream weights. It 
can be seen that weighting the acoustic stream by 70% led to the best recognition 
results, both on clean and on noisy audio. On clean speech, WER. of the audio- 
visual system is 1% lower than the audio only system (5% relative improvement). 
In the case of noisy audio, the relative WER decreases by 12.5%: word error rate 
is dropped from 53.94% absolute to 47.37% absolute. 

The noise level was selected to get a similar dimension of WER as in [15]. 
Figure 3 shows the progression of recognition rates for different SNR values. For 
rising noise levels higher improvements of the audio visual against the audio only 
recognition rates are achieved. 




Fig. 3. Plot of word error rates for different SNR values. 
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6 Conclusion and Future Results 

In this paper we have described how audio-visual speech recognition can be 
done with the Janus speech recognition toolkit, a HMM-based state of the art 
speech recognizer. Experiments were performed on a large vocabulary speaker 
independent continuous speech recognition task. We obtained good experimental 
results by training a stream recognizer, which first computes log likelihoods 
for each of the input modalities and then combines these hypotheses using the 
stream weights. With this approach, relative improvements on both clean and 
noisy speech were obtained. The achieved amount of improvements in relative 
WER is by now about half of the improvements reported in [15]. We think that 
this is mainly due to the fact that we could only use a fraction of the data used 
in [15]. 

The presented system provides a good basis for further audio-visual speech 
recognition research. We are now working on the improvement of the facial fea- 
ture tracking approach in order to being able to use more speakers from the 
database. In fact, we are now already able to use 200 speakers instead of 120 
used for the presented experiments. 

Among the first things that we plan to improve is the visual preprocessing 
of the data. So far, only histogram normalization is done. Since we observed 
that some subjects tilted their heads quite significantly, we hope to improve the 
recognition results by appropriate rotation of the input images in the future. 
Adaptive adjustment of the combination weights for the input modalities should 
also improve the recognition results in the future. 
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Abstract. The goal of image registration is to find a transformation 
that aligns one image to another. In this paper we present a novel au- 
tomatically image registration approach for images with structural dis- 
tortions (e.g. a lesion within a human brain). The main idea is to de- 
fine a suitable matching energy, which effectively measures the similarity 
between the images. The minimization of the matching energy is an ill- 
posed problem. Hence, we add a regularity energy borrowed from linear 
elasticity theory, which incorporates smoothness constraints into the dis- 
placement. The resulting energy functional is minimized by a Levenberg- 
Marquardt iteration-scheme. Finally, we give a two-dimensional example 
of these applications. 



1 Introduction 

An important problem in two- and three-dimensional medical image analysis is 
to match two similar images, resulting from the same or from different imaging 
modalities. Especially in brain research the development of fast deformable im- 
age registration algorithms has been an active topic of research in recent years. 
Here, a typical approach is the minimization of a suitable distance functional. 
Minimization strategies currently used deal with Navier-Stokes equilibrium equa- 
tions for linear elasticity given by a partial differential equation (PDE). Here, the 
external forces, given by the derivatives of the distance functional, are applied to 
the template-image. The template-image is deformed until an equilibrium state 
(described by the PDE) between the external forces and internal forces resisting 
the deformation is achieved. The resulting displacement field u satisfies the PDE 
with external forces /, see, e.g. [1,2,6,10,16]. 

Driven by ever more powerful computers, these algorithms have become im- 
portant tools, e.g. in guidance of surgery, diagnostics, quantitative analysis of 
brain structures (interhemispheric, interareal and interindividual), ontogenetic 
differences between cortical areas, and interindividual brain studies. Although 
these techniques have been applied very successfully for both the uni- and the 
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multimodal case (e.g. see [1,2,6,7,8,9,10,12,16,17,20]) these techniques may be 
less appropriate for studies using brain-damaged subjects, since there is no com- 
pensation for the structural distortion introduced by a lesion (e.g. a tumor, 
ventricular enlargement, large regions of a typical pixel intensity values, etc.). 

Generally the computed solution cannot be trusted in the area of a lesion. The 
magnitude of the effect on the solution depends on the character of the registra- 
tion scheme employed. It is not only that these effects are undesirable, but also 
that in some cases one is especially interested in where the lesion would be in 
the other image. If, for instance, we want to know which function of the brain 
is usually performed by the damaged area, we could register the lesioned brain 
to an atlas and map the lesion to functional data within the reference space. 

In more general terms the problem can be phrased as follows. Given are a de- 
formable template image T and a reference image R as well as a domain G 
including a segmentation of the lesions. The aim of the proposed image registra- 
tion algorithm is to find a “smooth” displacement-field u, which: 

Minimizes a given similarity functional between T and R 
under the condition that: 

The lesion G is conserved in the transformed template image T . 

The main idea is to define a suitable distance functional, which effectively mea- 
sures the similarity between the images. The presented approach can be seen as 
the well known “image inpainting approach” (e.g. see [3,4,5]) for the unknown 
displacement-field it, see [13]. The minimization of the presented matching energy 
is an ill-posed problem (see [16]). Hence, we investigate a Levenberg-Marquardt 
scheme for minimization of the novel distance functional. Here, we first linearize 
the least squares functional. The linearized functional is minimized within a so- 
called trust region around the actual solution. The trust region is quantified by 
a metric, which measures the elastic energy of the displacement. 



2 A Lesion Preserving Image Registration Algorithm 



2.1 A Lesion Preserving Similarity Functional 



In the situation that the intensities of the given images are comparable, a proper 
choice of for a distance functional is the so-called sum of squared differences 
between the images 



D(u) 



J (t(x i -u 1 (x),---,x d -u d (x)) — R(xi,- ■ ■ , Xd) 



2 

dx. 



This is a common criterion. It is used, for example, in the case that the images 
are recorded with the same imaging machinery, the so-called mono-modal image 
registration. Due to the absence of information in the domain G, we define a 
lesion-mask by 



A g(x) 



1 if x £ f2\G, 
0 if x € G 
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and consider the following similarity functional 

D e (u) = ^ j (t(x! - Ui(x),- ■ ■ ,x d - u d (x)) - R(xis ■ ■ ,x d )^ 



dx 



/ 0\G 

1 

2 .to 



J ^g(x)(t(x 1 - ui(x), ■ ■ ■ ,x d - u d ( x)) - R(x 1, • • • ,x d )^j 



dx. 



2.2 A Levenberg-Marquardt Iteration for Minimizing the Similarity 
Functional 



In order to minimize the functional D e {u ) we use the Levenberg-Marquardt 
iteration scheme. The Levenberg-Marquardt method is a variant of the GauB- 
Newton iteration for the minimization of D e . Here, for a current approximation 
u^ k \x) the nonlinear image difference h(u) = T(x — u(x)) — R{x ) is replaced 
by its linearization around u^ k \ x) within a ball of radius ||u|||, < where the 
energy norm 1 1 • 1 1 e is defined by 

w t (x)Lv(x)dx 

and a symmetric positive definite operator L. For our specific application we use 
the following operator 



Me = 



{v, v) E with inner product (v,w) E = J 



Lu(x ) := —/j,Au(x) — (n + A)V(Vu(a;)) 



with the so-called Lame constants A and [i. Using the method of Lagrange mul- 
tipliers, this is easy seen to be equivalent to minimize the quadratic functional 

Q(u) = f ^ k ^ (h k 4- J k u(x)) 2 dx + a(Lu,u), ( 1 ) 

Jo 2 

where hk := h(uM(x)) = T[x — u^ k \x)) — R{x), 

Jk := J h (u^(x)) = ^ k \x )) = (^y k x x )),...,^y k ){x))^ 



and the mask 



Afc(a;) = 

for the transformed subdomain Gk={y\y = x — u^ k \x) \/x € G}. 



1 if x £ fi \ Gk, 
0 if x £ Gk 



2.3 The Model of a Clamped Elastic Membrane 

The second term in equation ( 1 ) can be regarded as a penalty for ’’elastic 
stresses” resulting from the displacements of the images. This is a suitable model 
in many medical applications, for example, when pressure or movement is applied 
to a patient. By using Dirichlet boundary conditions the resulting displacements 
may be interpreted physically as the displacement of a clamped elastic mem- 
brane. For each iteration step we have the following result. 
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Theorem 1. Using Dirichlet boundary conditions for the operator L at df2, the 
unique minimizer u*(x) £ of (1) is characterized by the following 

variational equation 

[ V t (x)(^±JtJ k + aL)u(x)dx=- [ \k{x) J t k hkg}{x)dx \hp £ (Hl(fi)) d . 

J £2 2 J £2 

(2) 

Proof. Noting that 

f ^ ( hk + J k u(x)) 2 dx + a f u t (x)Lu(x)dx = f Xk(x)(2 J k hku(x) + h%)dx 
J £2 2 Jq Jq 

+ / u t {x){\k{x)JkJk + aL)u(x)dx. 

J £2 

Since the operator L is symmetric positive definite by using Dirichlet bound- 
ary conditions and Jj.Jk is symmetric positive semidefinite, it follows that the 
bilinear form 

B[u,v]:= / u t (x)(Xk{x) J k Jk + aL)v (x)dx 

J £2 

is symmetric and positive definite. Consequently, the weak solution of (1) is 



unique and given by the solution of (2). □ 

Note that by the definition of A k{x) the classical solution of (2) is given by the 
boundary value problem 

( aL + JlJk)u(x ) = -J k hk for x £ ft \ G k , (3) 

aL u(x) = 0 for x £ G k , (4) 

u(x) = 0 for a: £ dfl. (5) 



2.4 A Parameter Choice Rule 

An important problem is the proper choice of the parameter a in practical 
applications. A small a leads to strong artifacts due to the influence of high- 
frequency structures in the image data (e.g. the noise). Increasing a removes the 
artifacts and allows only smoother transformations. The result becomes worse if 
the parameter increases further. 

The optimal balance between the two extremes is a tough issue. In practice the 
costs of tuning the parameter are high and for most methods only ’’trial and 
error” approaches are available. In our implementation we determine a using a 
trust-region approach as presented in [12]. 

2.5 Discretization and Approximation of the Boundary Value 
Problem 

In order to discretize the operator J k in equation (3), we have to fix an overlap 
between the subdomains G k and i? \ G k - Therefore we enlarge G k by one point 
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GH UD 

(a) Artifical interface U for G 



GH UD 

(b) Discretization of U and G 



Fig. 1. The image information is missing on the domain G. 



parallel to the boundary of Gk, see figure 1. The extended domain is named by 
U. The operator Jk is approximated by using central differences for all points 
x £ fl \ (Gk U U). For instance for the three-dimensional case, we have 



Jk 



^ f rpk rpk rpk rpk rpk rjik \ ^ 

2 \ 1 i — l,j,0 — l) "> 



( 6 ) 



where 



T i,j,l = T ( x i - u[ k) (Xi,Xj,Xi),Xj - U^\xi,Xj,Xl),Xl - U { 3 k) (Xi,Xj,Xi)) 

the template image deformed by u^ k \ In order to discretize the elliptic operator 
L, we use a finite difference approach and we approximate the partial derivatives 
by second order approximations, for details see [11]. 



2.6 Fast Solution Methods 

In practice, the solution of the linear system (3)-(5) is the time consuming part of 
the Levenberg-Marquardt iteration. For the resulting discrete system a multigrid 
Correction Scheme (CS) was used (with optimal multigrid complexity O(N) for 
N picture elements) as a solver, for details see e.g. [11,15]. Since the resulting 
system is symmetric and positive definite other solvers like Krylov subspace 
methods [21], can be used. Note, that the underlying operator is anisotrop and 
consequently fast Fourier transformation (FFT) (see [22]) based solver cannot 
be used. 

In order to speed up the minimization process, we have implemented the Leven- 
berg-Marquardt iteration within a scale-space framework as presented in [14]. 
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Fig. 2. In the first row the reference (a), template (b) are shown. Registrations without 
and with the definition of a region G have been performed. In the second column the 
lesioned template and the transformation fields for the former (c) and the latter (d) are 
displayed. The corresponding results can be found in the last row. The contour around 
the sections corresponds to the silhouette of the reference. 
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The Levenberg-Marquardt iteration is first performed on a rough scale. The 
result of this scale is then propagated to a finer scale and the iteration is restarted 
here. This process is continued down to the finest scale of the underlying scale- 
space, yielding the final registration result. 



3 Results 

We demonstrate the algorithm on a pair of deformed histological sections. The 
template section was lesioned in three places. Figure 2 shows the reference (a) 
and the undamaged template (b). The contour marks the silhouette of the refer- 
ence section. Two registrations have been performed. One where no region G has 
been defined and one where the region G corresponds to the lesions. Results for 
the former are displayed on the left, and results for the latter are displayed on 
the right. In the second row the lesioned template along with the transformation 
field is shown, and the last row displays the results of the registrations. Here the 
we used fi = A = 1. 

From the transformation fields it is obvious that when no region G is defined 
the surrounding tissue is “pulled” into the lesion. With the proposed approach 
the transformation is interpolated into the regions defined by G and the lesion 
is preserved. 

4 Conclusion 

In this paper we have presented a pixel-based approach for nonlinear image reg- 
istration for images with structural distortions (e.g. a lesion within a human 
brain). The problem can be traced back to a modified functional whose mini- 
mizers represent the mapping which transforms one image into another. 

We achieve the minimizing of this functional by a Levenberg-Marquardt iteration 
in a few steps. The small number of iterations and the finite dimension of the 
problem act as a regularization. Although we have only showed two-dimensional 
results, the method has already been applied in artificial 3D cases [13], and in a 
patient study involving patients suffering from ischaemic lesions [18,19], using a 
Landweber scheme instead of the Levenberg-Marquardt scheme. 
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Abstract. This paper presents a knowledge-based image segmentation tool for 
organ delineation in CT (Computed Tomography) images. The noise and low 
contrast make the detection difficult. Therefore in this method, radial search, 
noise reduction method and post-processing algorithm have been implemented 
to improve the quality of contour detection. Three edge detection algorithms 
have been used and after detection several optimization methods have been 
employed to get the accurate contour from three detected contours. Finally to 
achieve higher accuracy of detection, active contour model (ACM), snake, has 
been used after the contour detected by previous methods. 



1 Introduction 

Radiotherapy is one of the most common cancer treatment techniques. It makes use of 
radiation beams to eradicate cancerous tissues. To plan radiotherapy treatment [1, 2], 
the location and the shape of region of interests (ROI) need to be identified. In many 
hospitals in the U.K., the delineation or outlining process is performed manually on a 
number of CT (Computed Tomography) images. Such process is prone to a high 
probability of interpersonal errors and lacks consistency [3]. To improve the 
consistency of the delineation process, semi-automatic delineation tools with 
computer-aided processing have been developed, see [2-4], Computers can provide 
more consistency and repeatability. 

The aim of this work is to develop an accurate and efficient delineation tool for 
contour detection in CT images. Due to the low contrast between tissues and organs, 
and noise in CT images [3, 5], automatic outlining is still difficult. To overcome these 
difficulties, knowledge has been imported and more than one algorithm have been 
used to improve the accuracy of the delineation. 

The noise and contrast problems restrict the performance of edge detection 
algorithms, so noise reduction methods have been used to improve the quality of the 
image [6] . Edge detection algorithms are based on different principles. Hence, each 
algorithm may give a different contour for the same region in the same image. This 
has been presented in Vickers’s work [5]. Once the edge pixels have been found by 
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those algorithms, the decision needs to be made select the most appropriate contour. 
Genetic algorithms and dynamic programming [7, 8] have been used in image 
processing, they are however time consuming [9]. In this work, the level-2 refinement 
method has been applied to reduce the processing time. 

Finally active contour model has been used for the higher accuracy of contour 
detection in this work. The Active Contour Model (ACM) firstly presented by Kass et 
al. [10] has been widely used for medical image segmentation recently. It differs from 
traditional low-level edge detection methods. High-level knowledge has been 
involved to find the appropriate local minimum. This deformable method developed 
by Kass is also known as ‘Snake’. It is energy minimizing based. Snake minimizes 
the energy function and moves the contour dynamically to the true border of ROI. 

Traditional snake needs the initial position of the snake as close as possible to the 
proper edge of the object, otherwise it will get lost during the convergence. This 
problem has been pointed out and solved by Cohen [11], who developed balloon force 
for snake convergence. Still the snake has the problem of detecting the concave 
object, so Xu et al. developed GVF snake [12] to solve this problem. 

Snake is a time-consuming method for contour delineation. Cohen increased the 
capture range of the snake, but when the initial snake is far away from the true border 
of ROI the processing time becomes another bottleneck of radiotherapy treatment 
planning. In this paper, the initial position of the snake provided by radial search 
method is close enough to the true contour, so the initial position of the traditional 
snake has been solved and also the processing time of detection is reduced. 

This paper describes the delineation tools developed to outline automatically ROI 
in the pelvic area. Section 2 presents the image processing strategy developed, which 
combines radial search with pre- and post-processing, and then the contour pixels 
analysis is explained, which focuses on edge pixel selection and contour refinement. 
Section 3 describes the implementation of active contour model in this application. 
Section 4 presents results of experiments. Section 5 draws a conclusion of this work. 



2 Automatic Delineation 

The radial search method (RSM) is widely used in contour detection because of its 
simplicity and scalability. Han [13] uses radial search method combined with 
thresholding, and Ruiz [14] uses radial search method with LoG. To improve the 
quality of processing, they use pre-processing to suppress the noise in images. 
Because each edge detection method has its own limitation, only one edge detection 
method used in their application can lead to misdetection. LoG is sensitive to noise 
pixels and threshold acts badly in low contrast images. To avoid the limitation of 
different methods, in this work more than one edge detection method is applied. This 
delineation tool combines pre-processing, three edge detection methods, and post- 
processing. Radial search methods have a number of limitations and fail to outline 
some concave objects. In this work however, attention is focused on outlining the 
bladder, which is mostly convex. To improve the accuracy of the detection techniques 
presented in [15], a number of modifications have been made and are described in 
thereafter. 
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2.1 Structure of Radial Searching Method 

The main structure of this delineation tool is: 

• Pre-processing for noise reduction; 

• Three edge detection algorithms with Bresenham line drawing algorithm to find 
edge pixels in ROI; 

• Post-processing for contour refinement; 

The pre-processing is using median filter for noise reduction. In [15], three pre- 
processing techniques were investigated, namely median filtering, mean filtering and 
limit filtering, and it was found that median filter gives the best performance when 
associated with the edge detection algorithms used. Three edge detection algorithms 
are: Thresholding, Mean value ± 3 standard deviation, and Mean value ± 3 standard 
deviation within a fixed window size W, which are named as Cl, C2, and C3. The 
detail of radial searching has been discussed in [15], 



2.2 Level 1 Post-processing 

Global noise reduction has been done by median filter, which removes some noise 
pixels in the CT image. After edge detection, there are still some noise pixels on the 
contour, which have big difference in location with its neighbor pixels and make the 
contour of ROI discontinuous. According to local features of edge pixels, noise pixels 
on the contour will be adjusted by considering features of neighbor pixels. 

To avoid the misdetection on the radius: 

• Once a potential edge pixel x k has been detected, the following N pixels: from 
x k+I to x k+N , are checked as shown in Figure 1. 

• If M% ( M defined by user, 60% is used in this work.) of them has same 
properties as x k , x k is an edge pixel; if not, use to find another potential edge 
pixel on the radius and repeat these processes. 




Fig. 1 . Judge the edge pixel on the radius 

In pre-processing the global noise reduction has been employed. Since it cannot 
remove all noise pixels on the contour, median filter again has been used for 
smoothing. Instead of removing the noise on radius based on the gray level of pixels, 
it works based on the length of radii to adjust the pixels location on the contour. 
During the processing, some radii with significant length difference are replaced by 
the medium value of neighboring radii. 
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2.3 Level 2 Post-processing: Contour Pixels Analysis 

In this work, three different algorithms have been used for edge detection. Each of 
them works independently, so after detection three different sets of contour pixels are 
presented. It is necessary to select the precise contour points from three sets of pixels. 

2.3.1 Edge Pixel Selection 

To choose the ‘best’ edge pixel on the radius, two methods have been investigated. 
Selection 1 : medium value selection 

The decision of choosing the edge pixel has been made according to the length of the 
radii created by three potential edge pixels. In this selection, the pixel with radius, 
which is the medium value of three radii, is considered as the best choice for edge 
pixel on the contour. 

• At 0, three edge pixels, which are detected on a ray, have radii known as Rl, R2, 
and R3; (Here Rl is the radius from Cl, R2 is from C2, and R3 is from C3.) 

• Sort (Rl, R2, R3) to get the medium value R of them; 

• According to R, the best edge pixel in three can be found; 

Selection 2: voting and mean value 

This selection is based on voting. It assumes that the edge pixel should appear at the 
most likely place. It means the edge pixel should be close to two of three detected 
pixels, which are closer to each other. For example, at 0 three edge pixels are detected 
on a ray, have radii known as Rl, R2, and R3. The selection will be made as 
following, 

Case 1: If /Rl-R2/</Rl-R3/ and /Rl-R2/</R2-R3/, R v =(Rl+R2)/2; 

Case 2: If /Rl-R3/</Rl-R2/ and /Rl-R3/</R2-R3/, R v =(Rl+R3)/2 ; 

Case 3: If /R3-R2/</R3-Rl/ and /R3-R2/</R2-Rl/, R v =(R3+R2)/2; 

Here R v is the average value of radius. 

2.3.2 Contour Refinement 

Due to the low contrast and noise in CT images, the edge detection algorithm cannot 
find the edge, which is heavily blurred or with large number of noise pixels. The 
result of detection will be some gaps on the contour or spikes on the contour. This 
causes the biggest error of contour detection. To avoid or reduce this error, further 
refinements are necessary. Here two methods are investigated for contour refinement. 
Refinement 1 : discontinuity removing 
It is a method to detect the discontinuity on the contour. 

1 . Set up a threshold T R for the difference of neighbor radii. 

2. If /R(i+1)-R(i)/>T R , then record /; 

3. If /R(j'+1)-R(j)/>T r , then record j; 

4. If R(j) and R(i) are neighbors, i.e.j-i=l, just pull the pixel back, R(j)=R(i)\ 

5. If R(j) and R(i) are not neighbors, then the value between R(j) and R(i) will be 
calculated by linear interpolation to fill the gap. 

This operation starts from first pixel on the contour and also need to compare the last 
pixel with the first one and then the gap or spike on the contour will be removed. The 
threshold T R is obtained from the manually outlined contour by comparing the 
difference of neighbor pixels’ radii. The largest difference is selected as the threshold. 
Refinement 2: median smooth 
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The median filter here is used on the radii, which are based on the edge pixels 
selected. Median filter will work on smoothing the spike on the contour. Here this 
refinement method is similar as the one mentioned in section 2.2. 



3 Active Contour Model 

Active contour model is a deformable model, which is based on the energy 
minimization of the snake. The two types of energy, internal and external energy, 
create internal and external force to pull the snake to the border of the ROI. In this 
work, the initial position is provided by RSM and snake is a refinement method to 
improve the accuracy of delineation. 



3.1 Theory of Active Contour Model 

In Kass’s work, the contour of the object is defined as an energy function, which can 
be written as following expression, 

E total = | E snake ( V ) dv = | l E M O') + E ex , OOjrfV ( 1 ^ 

The snake consists of internal energy E int and external energy E ext \ 

E mt (v) = ^[a\ x '(v )\ 2 + /J\x"(v)\ 2 ] (2) 

E ext (v) = -\V[G a *lf (3) 

a and (3 are weighting parameters that control the internal force to make the snake 
continuous and smooth. The external energy is based on the information of image, 
like border information, region information and so on. This force attracts the snake 
moving to the desired edge of the object. To minimize the energy E totah (4) must 
satisfy the Euler-Lagrange equation. 

CC*x"-P*x""-VE ext =0 (4) 



3.2 Snake-Aided Radial Search 

The traditional snake has the limited capture range. This means the initial contour 
must be close to the true boundary; otherwise it will be lost. Cohen solved this 
problem by using the inflation force. Unfortunately this increases the processing time, 
when initial snake is far away from true border of ROI. In this work, those two 
problems have been solved by RSM. RSM provides the initial contour close enough 
to the true border. Since the initial contour is close to the true boundary, the snake 
does not need to move faraway to the boundary. This reduces iterations of the snake 
and shortens the whole processing time. Traditional snake cannot converge to concave 
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shapes, because of lack of capture range [12], The initial position of the snake given 
by radial search is close to the true border, so the concave shape can be outlined. 
Active Contour Model in this work becomes a procedure of coarse-to-fme scheme. It 
is used to improve the accuracy of contour delineation. The strategy of this work can 
be presented as following, 

♦ Use RSM to detect the contour of ROI; 

♦ Use the contour detected by RSM as the initial position of the snake; 

♦ Use snake to refine the contour detected; 

Finally to reduce the discontinuity, the neighbor pixels have been checked again by 
comparing the length of the radiuses. This method is similar as the one mentioned in 
section 2.3.2, the discontinuity removing. 




Fig. 2. Contour detection and refinements. From left to right: RSM, RSM + level-2 refinement, 
snake-applied RSM, and refined-snake application. 




Fig. 3. Contour detection and refinements. From left to right: initial snake by RSM, snake in 
process, result after 50 iterations, and result of snake and manual outlining. 



4 Experiment Results 

The initial seed point needs to be within ROI for radial search method, so it is 
extracted from the average position of the center of the bladder in 30 CT images. By 
investigation, the range of center of bladder in 30 CT images is within the square 
(245, 159) and (272, 204). The given point used in this work is (260, 178). It is not 
compulsory to have a center of the bladder as the given pixel, because the algorithm 
can work as long as the given pixel within ROI. 

The results of contour delineation tool at different stages can be presented in Figure 2. 

From the figure above, the accuracy of detection has been improved. RSM 
provides the initial position of the snake quite close to the true border, so only after 
few iterations snake can be attracted to the contour of ROI. Figure 3 shows the 
efficiency of the whole processing. 

To measure the performance of contour detection, relative error is measured by 
comparing with manual outlining, which can be expressed by following equation: 
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E rel % = I R co„ (0 - (0 I / x 100 (5) 

Here /? OTm is the computer-processed radius and R man is manual outline. The 
improvements by snake can be expressed by following equation, 

R,J%)=100x(R refim -R smte )/R r ^ (6) 

Here R imp stands for the ratio of improvements and R re f me stands for the error by level- 
2 refinement and R sna ke stands for the error by snake. The snake has been tested on a 
series of images. The improvement can be seen from following table, 



Table 1 . Error comparison between two levels refinement and snake refinement 



Images 


Level-2 refine (%) 


snake refined (%) 


Improvements (%) 


VI 


4.25 


3.42 


19.5 


V2 


3.1 


0.13 


95.8 


V3 


1.53 


1.06 


30.7 


V4 


6.73 


2.45 


63.6 


V5 


2.24 


0.61 


72.8 


V6 


7.68 


6.28 


18.2 


V7 


6.52 


3.45 


47.1 


V8 


1.74 


1.69 


2.9 


V9 


4.93 


1.25 


74.6 


V10 


1.47 


1.24 


15.6 


Average 


3.99 


2.16 


45.9 



The largest improvement by snake can be up to 95.8% and the average improvement 
is 45.9%. These data show the snake provides the good contour delineation in medical 
image processing. Those images are selected randomly. Generally snake can improve 
the accuracy of detection. From the data above, the snake still has the problem if the 
image is suffered with noise and low contrast, like image V6. The result of snake has 
6.28% error because the noise attracts the snake to wrong local minima; therefore 
some other methods need to be developed to overcome it. 



5 Conclusions and Further Work 

This work presents successful edge detection in medical image application. Due to 
low contrast of some medical images, single edge detection algorithm might fail to 
find the edge pixels. Therefore for more accurate contour of ROI, more than one 
algorithm have been implemented to have more choices for edge pixel on the contour. 
To improve the accuracy, the pre- and post-processing have been used. These methods 
improve the individual algorithm’s performance. For the contour detected it is the 
combination of the results of those algorithms. The final edge analysis is made by 
contour pixel selection and refinement methods. 
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This method of edge detection can be scalable. More edge detection algorithms 
can be added for precise edge detection. More algorithms provide more options for 
edge pixels. This will improve the accuracy of voting edge pixel selection and avoid 
misdetection. Active contour model shows its affectivity for adjusting the contour 
detected by RSM. Even if there are the 2-level refinement methods applied after the 
RSM, to process some low contrast images this method is still not very successful. As 
it is displayed in previous section, snake can be used for further refinement. Snake 
finally pulls the contour as close as possible to the true border of ROI. 
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Abstract. In this paper, we propose the new gaze detection system 
with dual cameras (a wide and a narrow view camera). In order to 
locate the user’s eye position accurately, the narrow-view camera 
has the functionalities of auto focusing/panning/tilting based on the 
detected 3D eye positions from the wide view camera. In addition, we 
use the IR-LED illuminators for wide and narrow view camera, which 
can ease the detecting of facial features, pupil and iris position. To 
overcome the problem of specular reflection on glasses by illuminator, 
we use dual IR-LED illuminators for wide and narrow view camera. 
Experimental results show that the gaze detection error between the 
computed positions and the real ones is about 2.89 cm of RMS error. 

Keywords: Gaze Detection, Dual Cameras, Dual IR-LED Illuminators 



1 Introduction 

Gaze detection system is important in many applications such as virtual real- 
ity and video conferencing. In addition, they can help the handicapped to use 
computers and are also useful for those whose hands are busy controlling other 
menus on the monitor[18]. Most Previous studies were focused on 2D/3D head ro- 
tation/translation estimation[2][14], the facial gaze detection[3-9][15][16][18][21] 
and the eye gaze detection[10-13] [17] [22-25]. Recently, the gaze detection consid- 
ering both head and eye movement has been researched. Ohmura and Ballard 
et al.[5][6]’s methods have the disadvantages that user’s Z distance should be 
measured manually and take much time (over 1 minute) to compute the gaze 
position. Gee et al.[7] and Heinzmann et al.[8]’s methods only compute gaze 
direction vector and do not obtain the gaze position on a monitor. In addition, 
if 3D rotation and translation of the head happen simultaneously, they cannot 
estimate the accurate 3D motion. Rikert et al.[9]’s method has the constraints 
that user’s Z distance must be maintained unchanged during training and test- 
ing procedures, which can give much inconvenience to user. In the methods of 
[11-13] [15] [16], a pair of glasses having marking points is required to detect facial 
features, which can be also inconvenient to a user. The researches of [3] [4] [19] 
show the gaze detection methods only considering head movements and have 
the limits that the gaze errors are increased in case that the eye movements 
happen. To overcome such problems, the research of [20] shows the gaze detec- 
tion considering both head and eye movements, but uses only one wide view 
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Fig. 1 . The gaze detecting system 



camera, which can capture the whole face of user. In such case, the eye image 
resolution is too low and the fine movements of user’s eye cannot be exactly de- 
tected. Wang et al.[l]’s method provides the advanced approaches that combines 
head pose determination with eye gaze estimation by a wide view camera and a 
panning/tilting narrow camera. However, their method supposes that they know 
the 3D distance between two eyes and that between both lip corners, and there 
is no individual variation for the 3D distances. In addition, they suppose that 
they know the 3D diameter of eye ball and there is no individual variation for 
that. Based on the assumptions, they compute the gaze position on a monitor. 
However, our preliminary experiments show that there are much individual vari- 
ations for the 3D distances/3D diameter and such cases can increase much gaze 
errors. To overcome above problems, we propose the new method and system 
for detecting gaze position. 

2 Localization of Facial Features in Wide View Image 

In order to detect gaze position on a monitor, we first locate facial features (both 
eye centers, eye corners, nostrils) in wide view images. To detect facial features 
robustly in any environment, we use the method of detecting specular reflection 
on the eyes. For that, we implement the gaze detection system as shown in Fig. 
1. As shown in Fig. 1, the IR-LED(l) is used to make the specular reflections 
on eyes. The IR pass filter(2) in front of camera lens can only pass the infrared 
light (over 800 nm) and the brightness of input image is only affected by the 
IR-LED(l) excluding external illumination. The reason of using IR-LED(l) of 
880nm is that human eye can only perceive the visible and the near infrared 
light (below about 880nm) and our illuminators do not make dazzling to user’s 
eye, consequently. When a user starts our gaze detection system, the micro- 
controller(4) turns on the illuminator (1) synchronized with the even field of 
CCD signal and turns off it synchronized with the next odd field of CCD signal, 
successively [20]. From that, we can get a difference image between the even and 
the odd image and the specular reflection points on both eyes can be easily 
detected because their image gray level are higher than other regions [20]. In 
addition, we use the Red-Eye effect and the method of changing Frame Grabber 
decoder value in order to detect more accurate eye position[20]. In general, the 
NTSC signal from camera has high resolution (0 ~ 2 10 — 1), but the range of A/D 
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conversion by conventional decoder of the Frame Grabber is low resolution (0 ~ 
2 8 — 1). So, the NTSC signal in high saturated range is represented as 255 (2 8 — 1) 
gray level of image and both the specular reflection on eye (cornea) and the some 
reflection region on facial skin can be represented as same image level (2 s — 1), 
which makes it difficult to discriminate the corneal specular reflection only by 
image processing algorithm. However, the NTSC signal level of corneal specular 
reflection is higher than that of other reflection due to the reflectance rate. So, 
if we make the decoder brightness value lower, then the A/D conversion range 
of decoder can be shifted to the upper direction. In such case, there is no high 
saturated range and the corneal specular reflection and the other reflection can 
be discriminated, easily. Around the detected corneal specular reflection points, 
we determine the eye candidate region of 30*30 pixels and locate the accurate eye 
(iris) center by the circular edge detection method. Because the eye localization 
is performed in the restricted region, it can be done in real-time (below 3 ms in 
Pentium-Ill 866MHz). After locating the eye center, we detect the eye corner by 
using eye corner shape template and SVM (Support Vector Machine) [20] . We get 
2000 successive image frames for SVM training and additional 1000 images are 
used for testing. Experimental results show the classification error for training 
data is 0.11% and that for testing data is 0.2%. The classification time of SVM is 
so small as like 8 ms in Pentium-Ill 866MHz. After locating eye centers and eye 
corners, the positions of nostrils can be detected by anthropometric constraints 
in a face and SVM. In order to reduce the effect by the facial expression change, 
we do not use the lip corners for gaze detection. Experimental results show 
that RMS error between the detected feature positions and the actual positions 
(manually detected positions) are 1 pixel (of both eye centers) , 2 pixels (of both 
eye corners) and 4 pixels (of both nostrils) in 640x480 pixels image. From them, 
we use 5 feature points (left/right eye corners of left eye, left /right eye corners 
of right eye, nostril center) in order to detect facial gaze position. 

3 4 Steps for Computing Facial Gaze Position 

After feature detection, we take 4 steps in order to compute a gaze position on 
a monitor [3] [4] [20]. At the 1st step, when a user gazes at 5 known positions on 
a monitor ((1), (6), (12), (18), (23) of Fig. 3), the 3D positions (X, Y, Z) of initial 
5 feature points (detected in the section 2) are computed automatically [3] [4]. 
At the 2nd step and 3rd step, when the user rotates/translates his head in 
order to gaze at one position on a monitor, the new (changed) 3D positions of 
those 5 features can be computed from 3D motion estimation. Considering many 
limitations of previous motion estimation researches, we use the EKF (Extended 
Kalman Filtering) [2] for 3D motion estimation and the new 3D positions of those 
features can be computed by the EKF and affine transform [3] [20]. At the 4tlr 
step, one facial plane is determined from the new (changed) 3D positions of 
the 5 features and the normal vector (whose origin exists in the middle of the 
forehead) of the plane shows a gaze vector by head (facial) movements. The 
gaze position on a monitor is the intersection position between a monitor and 
the gaze vector [3] [4] [20]. 
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4 Auto Panning/Tilting/Focusing of Narrow View 
Camera 

Based on the new (changed) 3D positions of the 5 feature points (which are 
computed at the 2nd and 3rd step as mentioned in section 3), we can pan and 
tilt the narrow view camera in order to capture the eye image. For that, we 
also perform the coordinate conversion between monitor and narrow view cam- 
era using the internal/external camera parameters, which are obtained at initial 
calibration stage. Such calibration method is same to that between the wide 
view camera and the monitor. Detail accounts can be referred in [3]. When the 
user rotates his head severely, one of his eyes may disappear in camera view. 
So, we track only one visible eye with auto panning/tilting narrow view camera. 
Conventional narrow view camera has small DOF (Depth of Field) and there 
is the limitation of increasing the DOF with the fixed focal camera. So, we use 
the auto focusing narrow view camera in order to capture clear eye image. For 
auto focusing, the Z distance between the eye and the camera is required and 
we can obtain the Z distance at the 2nd and 3rd step (as mentioned in section 
3). In order to compensate the focusing error due to the inaccurate Z distance 
measure, we use an additional focus quality checking algorithm for the input eye 
image. If the focus quality does not meet our threshold (70 of the range (0 ~ 
100)), then we perform additional focusing process by sending the moving com- 
mand of focus lens to camera micro-controller. In this stage, we should consider 
the specular reflection on glasses. The surface of glasses can make the specular 
reflection, which can cover the whole eye image. In such case, the eye region 
is not detected and we cannot compute the eye gaze position. So, we use dual 
IR-LED illuminators like Fig. 1(6). When the large specular reflection happens 
from one illuminator (right or left illuminator), then it can be detected from 
image. As mentioned in section 2, the NTSC analog level of specular reflection 
region is higher than any other region and they can be detected by changing 
decoder brightness setting. When the large specular region proves to exist with 
the changed decoder brightness value, then our gaze detection system change 
the illuminator (from left to right or right to left) and the specular reflection on 
glasses does not happen, consequently. 

5 Localization of Eye Features in Narrow View Image 

After we get the focused eye image, we perform the localization of eye features 
as shown in Fig. 2. We detect P\ ~ P4 in right eye image as shown in Fig. 2 and 
also detect P5 ~ Pg in left eye image for computing eye gaze detection. Here, the 
Pl and P[ show the pupil center and the P2 and P \ does the iris center. J. Wang 
et al.[l] uses the method that detects the iris outer boundary by vertical edge 
operator, morphological ’’open” operation and elliptical fitting. However, the 
upper and lower region of iris outer boundary tend to be covered by eyelid and 
inaccurate iris elliptical fitting happens due to the lack of iris boundary pixels. In 
addition, their method computes eye gaze position by checking the shape change 
of iris when a user gazes at monitor positions. However, our experimental results 
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Fig. 2. The features for eye gaze detection from right eye 



show that the shape change amount of iris is very small and it is difficult to 
detect the accurate eye gaze position only by that information. So, we use the 
positional information of both pupil and iris. Also, we use the information of 
shape change of pupil, which does not tend to be covered by eyelid. In general, 
the IR-LED of short wavelength (700nm ~ 800nm) makes the high contrast 
between iris and sclera. On the other hand, that of long wavelength (800nm ~ 
900nm) makes the high contrast between pupil and iris. Based on that, we use 
the IR-LED illuminator of multi-wavelength (760nm and 880nm) as shown in 
Fig. 1(6). As shown in Fig. 2(b), the shapes of iris and pupil are almost ellipse, 
when the user gazes at a side position of monitor. So, the method of circular edge 
detection cannot be used. Instead, we use the canny edge operator to extract 
edge components and a 2D edge-based elliptical Hough transform. From that, we 
can get the center positions and the major/minor axes of iris/pupil ellipses. In 
order to detect the eye corner position, we detect the eyelid as shown in Fig. 2. 
That is because the upper and lower eyelids meet on two eye corner positions. To 
extract the eyelid region, we use the region-based eyelid template deformation 
and masking method. In detail, we make the eyelid edge image with canny edge 
operator and apply the deformable template as the eyelid mask. Here, we use 
2 deformable templates (parabolic shape) for upper and lower eyelid detection, 
respectively. From that, we can detect the accurate eye corners as shown in 
Fig. 2. Experimental results show that RMS errors between the detected eye 
feature positions and the actual ones (manually detected) are 2 pixels (of iris 
center), 1 pixel (of pupil center), 4 pixels (of left eye corner) and 4 pixels (of right 
eye corner). Based on the detected eye features, we select the 22 feature values 
(/i ~ fu are used in case that right eye image can be captured by narrow view 
camera as shown in Fig. 2 and fa ~ fa are used in case that left eye image can 
be captured). With those feature values, we can compute eye gaze position on a 
monitor. Detail accounts are shown in section 6. 
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Fig. 3. An example of gaze detection errors on a 19” monitor 
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6 Detecting the Gaze Position on a Monitor 

In section 3, we explain the gaze detection method only considering head move- 
ment. As mentioned before, when a user gazes at a monitor position, both the 
head and eyes tend to be moved simultaneously. So, we compute the additional 
eye gaze position with the detected 22 feature values (as mentioned in section 
5) and a neural network (multi-layered perceptron). Here, the input values for 
neural network are normalized by the distance between the iris/pupil center and 
the eye corner, which are obtained in case of gazing at monitor center. That is 
because we do not use a zoom lens in our camera. That is, the more the user 
approaches the monitor (camera) , the larger the eye size becomes and the farther 
the distance between the pupil/iris and the eye corner becomes, consequently. 
After detecting eye gaze position based on the neural network, we can deter- 
mine a final gaze position on a monitor by head and eye movements based on 
the vector summation of each gaze position (face and eye gaze) [20] . 

7 Performance Evaluations 

The gaze detection error of the proposed method is compared to that of our 
previous methods[3][4][18][20] as shown in Table 1. The test data are acquired 
when 95 users gaze at 23 gaze positions on a 19” monitor as shown in Fig. 3. 
Here, the gaze error is the RMS error between the actual gaze positions and the 
computed ones. Shown in Table 1, the gaze errors are calculated in two cases. 
The case I shows that gaze error about test data including only head movements 
and the case II does that the gaze error including head and eye movements. 
Shown in Table 1, the gaze error of the proposed method is the smallest in any 
case. 

Fig. 3 shows an example of the gaze detection errors on a 19” monitor. 
The reference positions are marked as ’’black circle” and the computed gaze 
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Table 1 . Gaze error about test data (cm) 



Method 


Linear 

interpol.[18] 


Single 

neural net [18] 


Combined 
neural nets [18] 


[3] 

method 


[4] 

method 


[20] 

method 


Proposed 

method 


case I 


5.1 


4.23 


4.48 


5.35 


5.21 


3.40 


2.24 


case II 


11.8 


11.32 


8.87 


7.45 


6.29 


4.8 


2.89 



positions are shown as ”X”. From the Fig. 3, we can know the gaze errors 
are more increased in lower region of the monitor. That is because our gaze 
detecting cameras are positioned on the top of monitor and fine movement of 
head and eye cannot be seen in case of gazing at the lower positions of the 
monitor, consequently. At the 2nd experiment, the points of radius 5 pixels are 
spaced vertically and horizontally at 1.5” intervals on a 19” monitor with the 
screen resolution of 1280x1024 pixels as such Rikert’s researclr[9]. The RMS 
error between the real and calculated gaze position is 2.85 cm and it is much 
superior to Rikert’s method (almost 5.08 cm). Our gaze error is correspondent 
to the angular error of 2.29 degrees on X axis and 2.31 degrees on Y axis. In 
addition, we tested the gaze errors according to the Z distance (55, 60, 65cm). 
The RMS errors are 2.81cm at 55cm, 2.85cm at 60cm, 2.92cm at 65cm. It shows 
that the performance of our method is not affected by the user’s Z position. Last 
experiment for processing time shows that our gaze detection process takes about 
500ms in Pentium-Ill 866MHz and it is much smaller than Rikert’s method (1 
minute in alphastation 333MHz). The researclr[l] shows the smaller angular error 
of below 1 degree, but their method supposes that they know the 3D distance 
between two eyes and both lip corners and there is no individual variation for 
the 3D distances and the 3D diameter of eye ball. However, our preliminary 
experiments show that there are much individual variations and such cases can 
increase much gaze errors (the angular error of more than 5 degree) . 

8 Conclusions 

This paper describes a new gaze detecting method. In future works, we have plans 
to research the method of capturing higher resolution eye image with zoom lens 
and it will increase the accuracy of final gaze detection. 
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Abstract. In this paper we present a new method for self- localization 
on wafers using geometric hashing. The proposed technique is robust to 
image changes induced by process variations, as opposed to the tradi- 
tional, correlation based methods. Moreover, it eliminates the need in 
training on reference patterns. Two enhancements are introduced to the 
basic geometric hashing scheme improving its performance and reliabil- 
ity: using quadtree for efficient data access and optimal rehashing for 
Bayesian voting. The approach proved to be highly reliable when tested 
on real wafer images. 



1 Introduction 

As computational power has increased over the past decade, machine vision 
systems have become far more capable than before. In semiconductor industry, 
where highest levels of precision and robustness are required, they evolved to 
become a mainstream automation tool enabling computers to replace human 
vision and guide robotic handling, assembly, and inspection processes. Various 
semiconductor manufacturing equipment require precise self-localization, so that 
operations such as lithography, cutting and inspection can be performed to ex- 
tremely tight tolerances. That is why self-localization on wafers has emerged as 
a very important task. 

There is a demand from machine vision tools to become more adaptive to 
in-process variations and allow location of reference patterns despite changes 
in visual appearance occurring during the manufacturing process. Such changes 
may include non-linear contrast variation, color inversion, re-scaling, rotations 
and partial pattern obliteration [7]. 

Traditional tools, found in most commercial packages today, adopt normal- 
ized grayscale correlation (NGC) which is adequate for locating patterns under 
ideal conditions, but cannot cope with pattern appearance changes at run-time. 
Correlation scores are sensitive to degraded images and exhibit low tolerance 
to image changes in scale, angle, obliteration and contrast variation. Some ven- 
dors are recently proposing different techniques for self-localization to counteract 
such negative effects. For example, PatMax software from Cognex applies geo- 
metric feature analysis to find patterns on the wafer. Individual key features 
are first found, so that attributes such as shape, angles, arcs and shading can 
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be used to achieve invariant matching. Stemmer Imaging utilizes different tools, 
such as Support Vector Machines (SVM), neural networks, and optimized Hough 
transform, besides NGC, to accomplish invariant pattern recognition. All these 
approaches somewhat limited, as they build upon training on particular, prede- 
fined feature (known as an “acquisition target”) printed at certain position on 
a wafer. During online self- localization this feature must appear in the field of 
view of the tool (it might be transformed though). Straightforward comparing 
current query image with all feasible features is unrealistic. 

In this paper we propose a method for self-localization on wafers. It estab- 
lishes a correspondence between the pattern currently observed in the field of 
view of the imaging tool and the previously constructed wafer map. It is fast 
enough for inline microscopy, robust to process variations and does not require 
training on the acquisition targets. The method is based on geometric hashing 
[2, 4, 5, 6], a well known pattern recognition algorithm. Tests performed on real 
wafer images demonstrate the high reliability of the suggested approach. 

2 Self-Localization as Pattern Recognition Task 

Pattern recognition is a process of identifying objects from perceptual data. 
Recognition is achieved by finding the correspondence between a given pattern 
and a set of predefined patterns. In the model-based PR approach, the predefined 
patterns are described in terms of various properties, such as shape, color, etc. 
These descriptions are referred to as “models” . A query pattern is then matched 
to one of those models. 

Localization on the wafer is defined in the following manner: given an “eye 
point” (e.g. partial image of the wafer) estimate its exact position on the wafer 
map. Therefore, map-based self- localization can be interpreted as model-based 
pattern recognition as follows. First the wafer map is constructed from partial 
images captured by a microscope imaging system moving over the wafer sur- 
face. A possible alternative is to use wafer layout file specifying its geometric 
structure. Wafer map can be divided into many adjacent parts to be identified 
during localization. These parts correspond to models in pattern recognition 
framework, whereas the eye-point plays a role of a query pattern. Matching the 
current eye-point to one of the previously prepared parts of the wafer map dur- 
ing localization is essentially the same, as associating a query pattern to one of 
the predefined models in pattern recognition. An example of the wafer eye-point 
and the corresponding part of the wafer map is shown in Figure 1. 

To cope with the enormous amount of geometric structures contained in wafer 
images, we choose to address the problem of self-localization using geometric 
hashing. Matching between query eye-point and wafer map is achieved by spatial 
correspondence of geometric features extracted from the images. These features 
are used to compose invariant model representations, stored in a database during 
the offline preprocessing stage of the algorithm. When analyzing the eye-point 
during localization, the same invariant representation is used as an indexing key 
to access the hash table and vote for the possible model matches. The model 
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Fig. 1. Example of the eye-point within a wafer map 




Fig. 2. Outline of the localization process. A wafer map that is constructed from 40 
model images is shown on the top. Voting results for the eye-point shown in the middle 
are plotted on the left. The enlarged image of the winning model (25) with its feature 
points marked with black dots is presented on the right. 



accumulating a significant number of votes indicates the correspondence of cur- 
rent eye-point with that model. An example of a typical localization process is 
presented in Fig. 2. This scheme provides low online complexity which deter- 
mines the actual localization time. It linearly depends on the number of features 
contained in the eye-point and independent of the number of models stored in 
the system. This allows to perform a fast localization even on very large scale 
maps. 

The localization algorithm is completed by verification. Given a set of candi- 
date models that accumulated the highest number of votes, one has to determine 
which is the best match to the query eye-point. The eye-point is characterized in 
terms of a feature points set {x'} in P 2 , and each of the candidate matching mod- 
els likewise described by its feature points {x^}. First, it is essential to find all 
Xi -o- x' point correspondences to compute a similarity transformation H which 
transforms a model to the eye-point: Hx^ = x' for each i. Two correspondences 
are enough to compute H, however, since the points in the query eye-point are 
measured inexactly (due to noise), all of the correspondences should be used to 
determine the “best” transformation given the data. Every true correspondence 
gives rise to two independent equations in the entries of H, while the outliers are 
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(a) Eye-point feature points (b) Voronoi tessellation (c) Voronoi diagram 

Fig. 3. The process of constructing the Voronoi tessellation of the eye-point for verifi- 
cation acceleration. 



robustly eliminated by the RANSAC algorithm. Then H is calculated by finding 
the least-squares solution of the over-determined linear system. 

An important issue is how to efficiently find all of the correspondences. The 
voting stage of the algorithm provides one corresponding basis (two point-to- 
point correspondences) between the candidate model and the eye-point. This 
allows us to approximate the desired transformation Hg by Hg and then, after 
applying Hg on the candidate model, every model point HgX; will correspond to 
the closest eye-point feature x'. 

Thus, to compute all of the point correspondences it is possible to check the 
distance of each point x' to every transformed model point Hx^. If the model 
contains m points and the eye-point contains n points, those inter-set distances 
are computed in 0(mn) time. This computation can be accelerated by employ- 
ing a Voronoi tessellation [3] for segmentation of the eye-point image. Voronoi 
tessellation is partitioning of a plane with n points into n convex polygons such 
that each polygon contains exactly one point and every point in a given polygon 
is closer to its central point than to any other. We start the verification by con- 
structing the Voronoi tessellation from the points in the query eye-point, which 
is done in 0(nlog(n)) time [3] (see Fig. 3). This allows us to find the corre- 
sponding point of x, in 0(log(n)) by checking what polygon within the Voronoi 
tessellation contains the transformed point Hx, and choosing its center point. 
It follows that the time needed for point correspondences calculation is reduced 
from 0(mn ) to 0(mlog(n)). 



3 Algorithm Performance Enhancements 

In this section we suggest two enhancements to the basic method improving its 
performance and reliability. They address the most problematic issues of the pro- 
posed localization method (as well as the general geometric hashing technique), 
which are the performance degradation in presence of noise and non-uniform 
occupancy of hash bins. 
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WAFER MAP 



QUADTREE 



Fig. 4. Decomposition of the wafer map with quadtree. 



3.1 Quadtree 

Unlike the absolute localization on the wafer, in the incremental localization 
discussed here, the initial position is assumed to be known approximately at 
the beginning of the localization session. The goal is then to refine the eye- 
point position estimation. Thus, it is possible to search for the eye-point only 
in relatively small “expectation region” on the wafer map, based on the known 
initial location. One can observe that in the case of localization on the wafer 
the set of models is actually formed from the neighboring wafer image tiles. 
This allows us to refine the basic algorithm using the quadtree - a technique for 
encoding an image as a tree structure (Figure 4). 

The root node represents the entire image; its children represent the four 
quadrants of the entire image; their children represent the sixteen sub-quadrants, 
and so on. 

Basic algorithm uses 2D hash table, while its bins are accessed according 
to the computed invariant coordinates. Multiple entries within single bin are 
organized in a linked list and retrieved altogether when the corresponding bin is 
accessed. The proposed enhancement replaces the linked list with a quadtree at 
each bin in the hash table. These trees correspond to the space partitioning of the 
global wafer map. This way it is possible to access only the relevant part of each 
tree during voting and thus, exclusively count for models from the “expectation 
region”. To put it differently, the quadtree allows to select any partial wafer 
area to be searched for the query eye-point. Practically, the quadtree approach 
reduces the number of irrelevant entries accessed in a hash table, without actually 
removing any contained data. 

3.2 Rehashing for Bayesian Voting 

Ideally, for every feature point of the query eye-point image there is a single 
model point in the corresponding hash table bin. In practice, features generated 
by other models can fall into the same bin or even coincide. To deal with this 
problem one may suggest to reduce the bin size. Unfortunately, the feature points 
are non-uniformly distributed over the hash table. Therefore, for any bin size 
there will be either overpopulated or empty bins. It is generally proposed to use 
rehashing to deal with the problem of non-uniform occupancy of hash bins [8] . 

Another problem is that the uncertainty in feature point position caused by 
image noise, shifts it away from the corresponding model feature. This can be 
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Fig. 5. Examples of the real wafer images used as models in the algorithm. 



solved by looking for the matching model feature in a certain, error dependent, 
neighborhood of the eye-point feature - voting region (see for example Bayesian 
approach in [9]). 

To combine and take the best from both approaches we propose to use a 
scheme that equalizes voting regions rather then feature density, as suggested 
before. This is achieved by re-mapping the hash entries using the mapping T : 
(u,v) —> such that 



v' = arctan(^) 

where r = \/u 2 + v 2 . The detailed derivation and the theoretical basis of the 
scheme is reported elsewhere [1]. This allows to improve localization (as well as 
general geometric hashing) computational performance by minimizing the hash 
table size and the number of bins accessed, while maintaining optimal recognition 
rate. Alternatively, the proposed scheme can be used in classical single bin voting 
to improve recognition rate. 



4 Experimental Results 

In this section we demonstrate the capabilities of the proposed localization algo- 
rithm and provide a systematic evaluation of its performance and effectiveness. 
We performed tests on real wafer images obtained on KLA-Tencor 5200XP over- 
lay metrology tool using 750 micron field of view. These images were used to 
construct a map covering an area of 2.25x12.75 millimeters on the wafer surface. 
Examples with enlarged partial images after preprocessing and corner detection 
are shown in Figure 5. There are two sequential stages involved in the localiza- 
tion algorithm: indexing based voting and candidates verification. During voting, 
eye-point invariant description is calculated and used to index into the hash ta- 
ble and vote for all the accessed entries. This description is based on a pair of 
features, a basis, see [6]. In many practical situations, there is a good chance 
that one of the points used to form a basis was reported by mistake and does 
not match any model point. Therefore, one should make multiple attempts us- 
ing different bases (e.g. different descriptions), to ensure with sufficiently high 
probability that at least one of them is free of outliers. We evaluate the localiza- 
tion algorithm performance by varying the number of different eye-point feature 
bases being used in voting. 
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Fig. 6. System behavior with different number of bases being used in voting. 



We tested the algorithm on a total of 10 4 different localization tasks to obtain 
a statistically meaningful measure of its performance. Each time we select a 
random eye-point and then, if correct location on the wafer “map” is reported 
by the algorithm (ground truth was available due to the nature of data set 
formation), the result is considered to be true positive (TP). In case of incorrect 
or no location (simply because none of the database models got enough votes), 
the result is regarded as a false positive (FP) or miss accordingly. 

The summary of the obtained results is presented in Figure 6. Hit rate 
HR = reaches 95% with 4% false alarm rate and 1% miss rate when 4 

bases are being used. The inaccuracy of the localization result may be formu- 
lated as follows. Assuming the eye-point features are measured with Gaussian 
error of standard deviation er, it can be shown that the RMS distance of the 
estimated point location from its true value is a(d/2n ) 1 ^ 2 , where n is number of 
correspondences used and d is the number of transformation parameters. Thus 
substituting d = 4 for similarity, and taking 50 sample points, results in the 
estimation error of 0.2 pixels. If the eye-point image of size 200x200 pixels is 
taken at resolution of 50 micron we come up with the localization accuracy of 
50 nanometer. 

HR of 100% is not achieved as the constructed map contains areas difficult 
for localization: having no distinguishable features or filled with repetitive ge- 
ometric structures. Note that even a human would have serious difficulties in 
solving the task of self-localization for “degenerate” eye-points selected from 
these unfavorable areas. Generally, we found that the algorithm performed well 
for eye-points from most of the wafer areas. 

5 Summary 

We presented a new method for self-localization on wafers, based on the geomet- 
ric hashing technique. The method is invariant to changes in visual appearance, 
such as non-linear contrast variation, scale, rotation and partial obliteration. 
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Two enhancements were proposed to the basic geometric hashing algorithm, im- 
proving its computational performance by optimally distributing the entries over 
the hash table and allowing an efficient access to the table entries. We showed 
how a verification can be significantly accelerated by applying a voronoi tessel- 
lation of the eye-point. Extensive experimental analysis demonstrate the high 
reliability of the proposed method. 
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Abstract. Quality assurance programs of today’s car manufacturers 
show increasing demand for automated visual inspection tasks. A typi- 
cal example is just-in-time checking of assemblies along production lines. 
Since high throughput must be achieved, object recognition and pose 
estimation heavily rely on offline preprocessing stages of available CAD 
data. In this paper, we propose a complete, universal framework for CAD 
model feature extraction and entropy index based viewpoint selection 
that is developed in cooperation with a major german car manufacturer. 



1 Introduction 

Quality assurance and final inspection are fundamental steps in production work 
flow. Automated visual inspection of assemblies is therefore in the focus of recent 
research (cf. [8], [6], [9] and [5]). Because CAD data of the assembled parts must 
be available for construction processes, model-based object recognition and pose 
estimation are eligible methods to allow automated visual inspection. Real-time 
production processes dictate the need for fast and accurate online algorithms. 
The framework we propose hence transfers as much of the algorithmic effort 
as possible to an offline preprocessing stage, yielding very fast and accurate 
online visual inspection. Our framework is based on a new generalized definition 
of features that supports the incorporation of different feature types under a 
common layer of abstraction. 

Besides the efficient online application, the selection of appropriate cam- 
era viewpoints is fundamental to robust visual inspection of assemblies. Our 
framework therefore also predicts viewpoints which optimally separate different 
expected assembly configurations of valid and invalid mounting scenarios. 

The article is structured as follows: In Section 2, we propose a generalized 
definition of features for model-based object recognition and pose estimation. It 
will be shown how the framework models rigid objects and flexible collections 
of objects. In Section 3, we will discuss how to accurately predict occlusions 
by applying a mixture of rule-based lookups and bounding volumes intersection 
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tests. Section 4 then addresses the calculation of optimal camera viewpoints using 
3D to 2D projection pursuit with collective entropy index. Finally, Section 5 
details the framework’s performance in feature extraction, occlusion prediction 
and object recognition. 



2 Characteristic Localized Features 

The framework proposed in this paper is a preprocessing stage suited for model- 
driven 3D/2D object recognition and pose estimation algorithms like the ones 
introduced by Lowe [7] and Araujo et al. [1]. In general, they use an initial 
object pose estimate to project features of a given 3D model on the camera 
view plane. Afterwards, they iteratively obtain improved estimates by matching 
the projected features with features extracted from real world images. Object 
recognition algorithms generally require features that are highly characteristic. 
For pose estimation, features have to be localized (must have a spatial position) in 
the model and image domain. Thus, our framework must automatically extract 
Characteristic Localized Features (CLFs). In order to be suitable for any 3D/2D 
object recognition scheme, each CLF must at least meet the following set of 
requirements: 

1. Projection: CLFs are spatially represented in 3D. To allow for 2D com- 
parison, CLFs must be projected on a camera view plane, given a camera 
model and an estimated pose. An appropriate projection prescript has to be 
defined for every type of CLF. 

2. Visibility determination: Since CLFs can become occluded under 2D 
projections, their visibility has to be determinable for any given view. CLFs 
that are visible are called active. 

3. Visual Appearance: Projected CLFs are compared to image features. 

Therefore, 2D projection must imply some visual outcome recognizable in 
real world images. E.g., in case of edges, the visual appearance would typi- 
cally be a strong local image gradient perpendicular to the edge direction. 

These requirements form a unique layer of abstraction that enables the pro- 
posed framework to perform all tasks without incorporating any further knowl- 
edge about feature types. 

Good CLFs are reliably trackable features in image sequences, as presented by 
Shi and Tomasi [11] or Schmid et al. [10]. Since they have been empirically shown 
to be appropriate, edges are commonly used (cf. [6]). We chose contour edges , i.e. 
edges that potentially form the object’s outline, to explain our approach in the 
following. Additionally, the framework incorporates functionality to deal with 
localized color and texture features. 

Edges which possibly form the contour of an object are interesting CLF can- 
didates because the object’s silhouette is always formed by a subset of contour 
edges. The silhouette will usually appear in real world images as intensity gradi- 
ents. What is more, Kettner and Welzl [4] provided empirical evidence that the 
number of contour edges in a 3D model is usually much smaller than the total 
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Fig. 1 . Left-. Automatically extracted CLFs (axis units in mm). Model edges are dis- 
played as thin dashed lines, extracted CLFs as thick black ones. Right Visibility map 
of the CLF highlighted on the left side. Black denotes view angles under which the 
CLF is active. Axis units denote the view angle measured in degrees. 




number of edges. The framework determines the set E c of a model’s potential 
contour edges by analyzing the angle between all its adjacent triangles: 

E c = { E\isconvex(E ) A ctE = <(N 1 E , N 2 E ) > 0} (1) 

where N l El N 2 E represent the normals of two adjacent triangles and E the edge 
shared by the triangles. The angle ole allows to assign a score to each element 
of E c , because a more acute angle yields a more frequent appearance of the edge 
under different projections. 

All elements of E c with a certain minimum score are new contour edge CLFs. 
To meet requirement 2., the visibility of the edge elements is pre-calculated 
relative to all possible discrete view-angles and stored in separate run-length- 
encoded visibility maps. An example of automatically generated contour edge 
CLFs and a particular visibility map is displayed in Fig. 1. 

Based on the specification of CLFs, a (basic) model can be defined as a 
set of CLFs referring to the same rigid object and object coordinate system. 
Furthermore, an aggregation can be described as a tree in which the root node 
represents the aggregation’s pose with respect to the world coordinate system. 
Each sub-node represents a basic model and the model’s pose (6DOF) relative 
to the parent node. 

3 Occlusion Prediction 

Inferring aggregation poses from real world images by means of 3D/2D object 
recognition schemes always involves the projection of the aggregation features 
on 2D camera planes. Regarding our framework, the projection of CLFs belong- 
ing to an aggregation might result in inactive (occluded) CLFs. Fig. 2 shows 
that any CLF might either become occluded by parts of the basic model it is at- 
tached to or by other basic models of the aggregation. The former occlusion type 
will be termed intramodel occlusion , the latter intermodel occlusion. Automated 
inspection in car industry requires fast online occlusion prediction. Intermodel 
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Fig. 2. The two occlusion types occurring with aggregations. Left: Intramodel occlu- 
sion. A contour edge CLF along the bolt’s thread (dashed black line) is hidden behind 
the same bolt’s head. Right: Intermodel occlusion. The same contour edge CLF, partly 
occluded (dashed black line) by a knob. 

occlusions are correctly predicted by lookup operations in the visibility maps. 
In the worst case, these maps consume space in the order of O(c-v), with c 
denoting the number of CLFs and v referring to the number of scanned view 
angles during map calculation. The lookup operation has efficient constant time 
complexity per call. 

Extending the lookup strategy to aggregations would require to pre-calculate 
the visibility maps for all CLFs attached to every possible aggregation configura- 
tion. This would lead to combinatorial explosion of storage space consumption. 
Therefore, intermodel occlusion prediction is based on tightly wrapping each ag- 
gregated model in a small number of simple geometric bounding volumes such 
as boxes or spheres. Our framework performs this task offline during aggregation 
creation. The online part of occlusion prediction first checks the pre-calculated 
visibility maps. For each visible candidate, view-rays between a virtual camera 
and points on the candidate CLF are tested for intersection with each bounding 
volume, thus ruling out features that are (partially) hidden behind parts of the 
aggregation. The intersection tests have a reasonable worst case time complexity 
of 0(c v -b), with b denoting the number of bounding volumes and c v the number 
of CLFs passing the visibility map test. 

4 Viewpoint Selection 

In order to support robust recognition, a further task of the framework is to 
determine those viewpoints from which an assembly might be inspected best. 
In this context, Vazquez et al. [13] proposed the information theoretic measure 
viewpoint entropy. It expresses the amount of information conveyed in a certain 
scene that is being watched from a given point. Measures like viewpoint entropy 
are often based on the visual appearance of a specific feature. Though we use an 
entropy measure, too, the CLF abstraction enables us to estimate the underlying 
probability distributions from the location of a variety of features. The entropy 
measure employed here was recently introduced as a class separability index [12] 
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Fig. 3. Top: A knob, screw and nut aggregation in configurations typical for invalid 
(left) and valid (right) mounting. Bottom : The good quality view (left) allows good 
distinction between different nut positions. In the bad quality view (right), the nut 
position is hard to infer as large parts of it are hidden behind the knob. 



and is called collective entropy. It estimates the quality of a view by measuring 
how distinguishable aggregation configurations will be under projection onto a 
given camera plane. An example with two configurations is shown in Fig. 3. 

Generally, collective entropy describes how well measurements in Cartesian 
space, each belonging to a distinct class, might be separable from each other with 
respect to the class labels. It is calculated by partitioning the N measurements 
to.; into d-dimensional cells with lryper-cuboid topology: 

To; = (TOjj , • • • ,m id ) £ * = 1, ... ,7V (2) 

Rj = [min to*. , maxmJ , 1 < j < d, i = 1, . . . , N. (3) 

The faces of the lryper-cuboid cells are constructed by dividing each range 
of values Rj into B parts of equal length. An initial cell resolution is chosen 
and the m, are partitioned accordingly. Afterwards, one obtains the conditional 
entropy which Cover and Thomas [2] define as 

H{X\Z) = -^ P(z) ■ P(x[ z) ■ log 2 p(x\z) . (4) 

zgz xex 

where each z € Z is a non empty hyper-cuboid cell and x £ X is the set of mea- 
surement class labels. Thus, H(X\Z) indicates how uniformly distributed the 
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Fig. 4. Complete map of collective entropy indices. Dark areas denote high quality 
view angles, light areas indicate bad quality (all axis units in degrees). The arrows 
point to the map positions corresponding to the bottom two views in Fig. 3. 



measurements are, given a certain partitioning resolution. However, H{X\Z) is 
not robust against the shifting of cell borders. Singh [12] therefore repeatedly 
lowers the cell resolution and recalculates the conditional entropy until a min- 
imum resolution is reached. Collective entropy is then taken as the area under 
the curve of the conditional entropy values with respect to cell resolution. 

Viewpoint selection iteratively places a virtual camera at discrete view angles 
in an orbit around an aggregation. For each iteration and for each expected 
configuration, the positions of visible CLFs are projected to the camera plane. 
Afterwards, the probability distributions in (4) are obtained by Monte Carlo 
sampling from the CLF location domain. The complete scheme can be regarded 
as 3D to 2D projection pursuit with collective entropy as projection pursuit index 
(cf. [3]). To our knowledge, it has not been tried before. The process yields a 
map that indexes the degree to which any discrete view angle conveys separable 
information about the observed scene. Some results are shown in Fig. 3. 



5 Performance 

During object recognition, the step inducing the highest computational load is 
the 3D to 2D projection of features because it involves online occlusion predic- 
tion. Therefore, we evaluated the performance of our online algorithm in the 
following way: First, we chose an evaluation candidate out of a set of aggrega- 
tions with varying complexity. Single basic models with a total number of less 
than 1000 CLFs were considered to be of low complexity. In contrast to this, ag- 
gregations of more than two basic models with a total number of more than 2000 
CLFs were considered to be of high complexity. Each candidate was randomly 
rotated in 3D and online occlusion prediction carried out in 1000 runs. We then 
calculated the average execution times which are visualized in Fig. 5). It shows 
that even for the most complex aggregation the algorithm executes in less than 
12ms. The execution time scales in average approximately linear to the total 
number of CLFs. To ensure that the results of our automated feature selection 
are suited for model-based object recognition, we first determined the average 
amount of active CLFs similar to the above evaluation scheme. The results are 
listed in Table 1. The average amount of active CLFs is well balanced for the 
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Fig. 5. Performance of occlusion prediction on a Pentium 4 PC (2GHz, 512MByte). 



Table 1 . Average number of active CLFs compared to their total number. 



object 


total no. CLFs 


avg. act. CLFs 


nut 


92 


(100%) 


13.6 


(14.8%) 


bolt 


148 


(100%) 


28.0 


(18.9%) 


flat washer 


164 


(100%) 


37.2 


(22.7%) 


oil lid 


418 


(100%) 


32.7 


(7.8%) 


knob 


1747 


(100%) 


74.0 


(4.2%) 


assembly 


2059 


(100%) 


112.4 


(5.5%) 



Table 2. Average and standard deviation of relative and absolute pose estimation 
accuracy. 



DOF 


M relative 


G relative 


M absolute 


& absolute 


x [mm] 


-0.15 


0.53 


-0.31 


0.53 


y [mm] 


-0.002 


0.4 


-0.81 


0.73 


z [mm] 


-0.007 


4.76 


-2.6 


3.9 


roll [°] 


0.006 


1.27 


0.578 


1.4 


pitch [°] 


0.097 


0.9 


-0.513 


1.57 


yaw [°] 


0.068 


0.84 


-0.45 


1.8 



first three objects in Table 1 and rather low for the ’’oil lid”, ’’knob” and the 
assembly of ’’knob”, ’’bolt” and ’’flat washer”, indicating that their CLF sets 
should be compressed. 

Object recognition and pose estimation performance was evaluated with an 
industrial system (cf. [6] ) . A standard camera with 320x240 resolution was moved 
around a mounted oil lid at constant speed and a distance of approx. 70mm, 
recording 420 images. The object pose was calculated for each image. The average 
and standard deviation of relative (i.e. image-to-image) and absolute accuracy 
are given in Table 2. Note that the average error of parameter estimation relative 
to the distance of the camera is always smaller than 1 %. Thus, model-based pose 
estimation meets the strong accuracy requirements of car industry. 



6 Conclusion 



We presented a complete, universal framework for automated selection of fea- 
tures and viewpoints for model-based visual inspection that was developed in co- 
operation with the DaimlerChrysler AG. Given CAD data of real world objects, 
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the framework extracts characteristic features and prepares them for fast and 
robust occlusion prediction. It further determines high quality viewpoints to in- 
spect an assembly from. Our feature extraction approach has been demonstrated 
for contour edge features. Performance results for the offline model preparation 
were given accordingly. For online occlusion prediction, execution time in the 
average case scaled approximately linear to the amount of processed features. 
The tests have been carried out on CAD models of car production assemblies 
and standard industrial fixation elements. 

The underlying concepts for occlusion prediction and viewpoint selection are 
not restricted to contour edges, but can also be used for a wide selection of other 
kinds of 3D localized features which meet the CLF requirements. The proposed 
framework is thus based on a novel layer of abstraction for features in general. 
It was successfully tested with an industrial object recognition system. 
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Abstract. Dyed barley cells in microscope colour images of biological 
experiments are analysed for the occurrence of haustoria of the powdery 
mildew fungus by a fully automated screening system. The region of 
interest in the images is found by applying Canny’s edge detector to 
the hue channel of the HSV colour space. Potential haustoria regions 
are extracted in RGB colour space by an adaptive Gaussian mixture 
classifier based on the Expectation Maximisation (EM) algorithm. Since 
the classes cell and haustorium are at very close quarters, their correct 
separation is a crucial part and needs a constraining mechanism which 
ties the EM algorithm to its initialisation data to prevent a too large 
deviation from it. 



1 Introduction 

Automating the screening and the analysis of biological experiments is a chal- 
lenging research area in the field of bioinformatics and engineering. This paper 
is related to a project where resistance mechanisms of crop plants against the 
powdery mildew fungus are studied from the genetical point of view. In the exper- 
iments, young barley leaves are bombarded with DNA-coated tungsten particles 
to “switch on or off” desired genes in cells. For analysis purposes, an additional 
reporter gene 1 is expressed in cells that were hit by a particle. This dyes the af- 
fected genetically transformed cells greenish blue and allows their identification 
by bright field microscopy [8]. The task is to evaluate the susceptibility of the 
genetically transformed cells to the powdery mildew fungus under the impact 
of different test genes. A successful penetration of the fungus into the cell is 
indicated by the development of a haustorium - a dark object with “fingers” 
that is located between the cell wall and the cell membrane and feeds the fungus 
by leaching the cell. These objects have to be counted in an automatic analysis 
procedure. 

Since there are many genes to be considered for a potential resistance of the 
plant against pathogens, a big number of experiments has to be performed to 

1 /3-glucuronidase (GUS) reporter gene 
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Haustoria of the powdery mildew fungus 
within a transformed barley cell 




Haustoria of the powdery mildew fungus 
within a transformed barley cell 



Fig. l. Cutouts of microscope images of barley cells. The dyed cells are genetically 
transformed, both cells contain two haustoria of the powdery mildew fungus. At coarse 
scales, Canny’s edge detector marks these cells by a closed boundary. 

Color version available via http : //bic-gh. ipk-gatersleben. de/wgrp/mue/pr j03 . php 



attain a sufficient statistical confidence. Therefore, an automated image acqui- 
sition system and an automatic analysis procedure is needed. Manual screening 
is a tedious, subjective and time-consuming task that cannot be handled by 
laboratory assistants due to that huge amount of data. For an automatic im- 
age acquisition, the microscope slides are mounted on an x-y table which scans 
a number of preparations fully automatically under the control of a computer, 
e.g., overnight. Now, finding genetically transformed cells and therein assessing 
the development status of the haustoria without human interaction is the task 
and the challenge of the analysis procedure. 

This paper describes a method to automatically identify suspicious objects, 
i.e. , parts of genetically transformed cells that may be a haustorium. It is or- 
ganised as follows: Section 2 introduces the properties of the image material and 
explains how the regions of interest, i.e., genetically transformed cells, are found 
in the images. Afterwards, Section 3 describes the identification of potential 
haustoria via the Expectation Maximisation (EM) algorithm, before Section 4 
concludes the paper. 
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2 Preprocessing of the Image Material 

Figure 1 shows two typical cutouts of microscope images, both containing one 
dyed genetically transformed cell with two lraustoria of the powdery mildew 
fungus inside. By default, the microscope camera produces images of 2600 x 2060 
pixel in 24-bit colour. 

In [5] we have shown that the dyed genetically transformed cells can be 
reliably detected by applying Canny’s edge detector [2] to the hue channel of 
the HSV colour space, rather than performing multi-dimensional edge detection 
in the RGB colour space or using histogram-based methods. At a coarse scale 
Canny’s algorithm marks the dyed cells by a closed boundary. The bounding box 
of these closed contours will be the input of the further lraustorium detection 
procedure. Unfortunately, the lraustoria stand out scarcely from the dyed cell, 
and there is no such straightforward colour space transformation to separate 
them as good as the dyed cells from the remaining cell tissue. Therefore, we 
stay in the RGB colour space, which contains the entire image information, and 
show what lraustorium detection results can be achieved by pixel classification 
methods. 



3 Cell Image Analysis by Clustering in Colour Space 



3.1 Naive Bayes Classification 

Suppose a naive Bayes classifier at first. A number of N d-dimensional data 
vectors x n £ R dxl from the entire data set X £ R. dxN has to be classified 
into K classes. If the prior (a priori) probabilities P(k) and the probability 
density functions p(x|fc) of the k = 1 . . . K classes are known, then the posterior 
(a posteriori) probability P(k |x„) of a sample vector x„ to belong to class k can 
be calculated by Bayes’ rule [1] (maximum likelihood decision) according to 



E P {j) P(*n\j) 



(1) 



Inspecting our data in the RGB colour space, we can decompose the mixture 
distribution of colours into three stretched ellipsoids, representing the three dom- 
inant image matters, namely background , cell , and haustorium. Such ellipsoidal 
distribution can be well modelled by the multivariate Gaussian distribution, 
which is described by the mean vector /i, specifying the center point of the el- 
lipsoid, and the covariance matrix X, which is responsible for the shape and the 
orientation of the ellipsoid. 

P(x |/x fc) E*) = . * , o = Wx-«) ( 2 ) 

V det Sfc(27r) d 

See Figure 2 for the segmentation results of this naive Bayes classification 
where the parameters of the classes were taken from typical samples. In the upper 
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Fig. 2. Segmentation by a naive Bayes pixel-classification in RGB colour space mod- 
elling the classes by multivariate Gaussians. 

images, the pixel-labels are depicted in a soft-output manner, i.e., the vector of 
the posterior probabilities [P(k = 3|x), P(k = 2|x), P(k = l|x)] T is assigned 
to the RGB value of each pixel, making the saturation of the colour follow the 
reliability of the estimate. The lower figures show both the clusters in RGB 
colour space as well as the principal components (eigenvectors) of each cluster. 
As can be seen, simply assigning parameters from typical images for the three 
classes and performing a naive Bayes classification does not provide satisfactory 
results because the parameter set will never match the actual scenario sufficiently 
due to some inevitable variations in colour and illumination in the image data. 
Therefore, some “self adaptation” of the classification algorithm to the actual 
data is needed to improve the classification results. 

3.2 EM Classification Using the Complete Data Set 

The Expectation Maximisation (EM) algorithm [4,7] is known to be a powerful 
clustering technique for mixture distributions where the parameters of the un- 
derlying probability density functions are adapted in an iterative way, trying to 
yield the best recovery of the mixture components. Its clustering performance 
depends on two major conditions: the precision the actual data is represented by 
the data model, as well as the initialisation parameters, because it can converge 
to local extrema instead of finding the global optimum. Such clustering methods 
are used for many different applications in image processing, e.g., skin detec- 
tion [6]. In [3] an advanced image querying system is described which applies 
the EM algorithm to an eight-dimensional space of colour, texture, and position 
features, where the number of mixture components is chosen following the Min- 
imum Description Length (MDL) principle. Fortunately, we know the number 
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Fig. 3. Segmentation results of the EM algorithm when iterating on the entire data 
set of RGB colour vectors. 

of mixture components in feature space very well due to the speciality of our 
image material. Furthermore, colour appears as the dominant feature, therefore 
we can ignore texture and position features and use the RGB colour information 
as the only feature. 

We initialise the a priori probability of the classes with P(k) = 1 /K = 1/3 
(since we do not know P{k) in advance) and perform a data-driven initialisation 
of the mean vectors and covariance matrices of the classes from exemplary, hand- 
segmented image parts, as already done for the naive Bayes classification. Then, 
the iteration of the EM algorithm is run in the following manner: 

The probability (at iteration step t) of each data vector x„ to belong to class 
k is calculated (expectation step) by 

P‘(k\ |x„) = P'WpWri.Sj) . (3) 

E pt ti) p(x„|/x*.,S*) 

3 = 1 

A new parameter set for the iteration step t + 1 containing the prior probabil- 
ities, mean vectors and covariance matrices for each class is calculated according 
to (maximisation step) 



N 



pt+1 ( fc ) = ^E pt ( fc K) 



i 



n — 1 
N 



/x fc +1 - NP t+1 (k) E p4 ( fc l x ») Xl1 



( 4 ) 

( 5 ) 
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Fig. 4. Segmentation results of the EM algorithm iterating data vectors that were 
estimated by a reliability of at least Rmin = 0.65. 

s ! +1 = NP ! + i {k) E pt ( x ™ - /4 +1 ) ( x « - /4 +1 ) T (6) 

The algorithm is terminated when the labelling in the segmented image does 
not change anymore. 

As can be seen in Figure 3, the clustering separates the background and cell 
class very well but it suffers from an overestimation of the haustorium class. 
This solution is optimal from the EM point of view, but it is not our desired 
result for an appropriate segmentation. In spite of different initial parameters, 
the EM algorithm tends towards bad results of the same manner. Incrementing 
the model order, i.e. , providing more classes generally does not yield more solid 
results, especially for the right hand image. 



3.3 Constraining the EM by Reliability Information 

A straightforward solution to achieve appropriate segmentation results is found 
in constraining the algorithm to the initial parameter set, which is known quite 
well in our particular case. Using the complete data set (all image pixels) 
makes a large number of cell labels to turn over into haustorium labels dur- 
ing the iterations. Iterating on reliably estimated data vectors only (instead 
on the entire data set) prevents the algorithm from deviating too much from 
its initial parameters. The classification reliability of each sample is given by 
R = maxfclP^/cjxn)} G [1/K ...1] and is inherently calculated in each itera- 
tion. 
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Table 1. Tapering the subset of data samples which the EM uses for iteration by a 
stepwise variation of the reliability parameter Rmin- 




In the following, we restrict the data set, which the EM operates on, to data 
samples that were classified with a reliability of at least R m in ■ Note that the 
lower bound of this parameter depends on the number of classes and that an 
appropriate parameter value has to be found empirically by a visual inspection 
of the segmentation results. See Table 1 for a test series of our particular segmen- 
tation problem. It can be observed that there is a significant changeover between 
Rmin = 0.50 ... 0.60. Choosing i? mi „ larger than 0.75, we observed convergence 
problems of the algorithm for the right hand image, where the algorithm os- 
cillated harmonically between two states instead of terminating. This can be 
explained by the recurrent changing of the considered data set parts during the 
iterations and needs further attention. 

Figure 4 shows the detailed segmentation results for R m i n = 0.65. Despite 
some misclassified objects in the haustorium class it shows the haustoria quite 
good — with this method we are able to automatically identify suspicious ob- 
jects, i.e., potential haustoria. Now, further analysis on the detected objects is 
needed to distinguish haustoria from discolourations or other parts inside the 
cell that have a similar colour, e.g., the cell nucleus. As a next step, therefore 
these image parts have to be further evaluated, taking form parameters of the 
detected objects into account, e.g., by detecting the “fingers” of the haustoria. 
This will be examined in the near future and is out of the scope of this paper. 

This paper is accompanied by a continuative web site of the presented results. 
Visit http://bic-gh.ipk-gatersleben.de/wgrp/mue/prj03.php for a more 
detailed compilation of exemplary cell images and their clustering results. 





Automating Microscope Colour Image Analysis 543 



4 Conclusions 

The Expectation Maximisation (EM) algorithm is applied in the RGB colour 
space to perform a segmentation of microscope colour images for the identifi- 
cation of small objects which stand out scarcely from the region of interest. To 
provide satisfactory results, it is shown that this special problem needs a con- 
straint mechanism which ties the EM algorithm to its initialisation parameters 
and forbids a too large deviation from it. This constraint mechanism is realised 
by dynamically restricting the data set the algorithm operates on to a reliably 
estimated part only. The mechanism is parametrised by a reliability threshold 
parameter which has to be determined empirically. This technique prevents a 
defection of the desired segmentation and provides good retrieval results of sus- 
picious objects via an automatic analysis procedure. 
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Abstract. In January 2004 the High Resolution Stereo Camera (HRSC) on 
board the ESA mission Mars Express started imaging the surface of planet 
Mars in colour and stereoscopically in high resolution. The Institute of Photo- 
grammetry and Geoinformation (IPI) of the University of Hannover and the 
Chair for Photogrammetry and Remote Sensing (LPF) of the Technische Uni- 
versitat Miinchen are jointly processing the data of the HRSC: Using automati- 
cally extracted tie points and Mars Orbiter Laser Altimeter (MOLA) data, the 
exterior orientation of the Mars Express spacecraft is being calculated perpetu- 
ally in a combined photogrammetric bundle adjustment during the two years 
lasting mission. This paper describes the used approaches for tie point matching 
and bundle adjustment. On the basis of two selected orbits the results of the 
matching and the achieved accuracy of the bundle adjustment are presented and 
evaluated. 



1 Introduction 

In June 2003 the European Space Agency (ESA) launched the Mars Express space- 
craft from the Baikonur launch pad in Kazakhstan. After a journey of about six 
months the orbiter was successfully inserted into a polar orbit around Mars. During its 
two years mission the High Resolution Stereo Camera (HRSC) on board of Mars 
Express images large parts of the Mars surface. The HRSC is a multisensor push- 
broom camera consisting of nine charge coupled device (CCD) line sensors mounted 
in parallel for simultaneous high resolution stereo, multispectral, and multi-phase 
imaging [1], At pericenter about 300 km above the surface of Mars a ground resolu- 
tion of approximately 12 m is attained. The Camera Unit (CU) of the HRSC addition 
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ally comprises a Super Resolution Channel (SRC) which captures frame images em- 
bedded in the basic HRSC swath at a ground resolution of up to 2.5 m. 

The three-dimensional position and attitude of the spacecraft is constantly determined 
by the European Space Agency (ESA) by combining techniques of measuring Dop- 
pler shifts, acquiring ranging data, triangulation measurements and a star tracker 
camera. These measurements result in a three-dimensional position and attitude of the 
spacecraft over time which can be considered as approximate exterior orientation in 
classical photogrammetry. Elowever, these values are not consistent enough for high 
accuracy photogrammetric point determination. Therefore, a bundle adjustment (EO) 
has to be performed using these values as direct observations for the unknown EO 
parameters. As further input for the bundle adjustment automatically extracted tie 
points derived via digital image matching (DIM) are being used. Additionally, ground 
control points (GCPs) are necessary to transform the results into a Mars-fixed coordi- 
nate system. Because on Mars very few classical GCPs exist, a globally available 
digital terrain model (DTM) is applied. 

In section two of this paper the approach for the determination of the EO of Mars 
Express is presented. In section three the results of the tie point matching and the 
bundle adjustment derived from two selected test orbits are shown and discussed. 



2 Photogrammetric Point Determination 

The processing of the HRSC data is divided into two steps. At first tie points are be- 
ing extracted using software developed at IPI in Hannover. The derived tie points 
serve together with the observed EO and the DTM as input for the bundle adjustment 
developed at LPF in Munich. With the resulting adjusted EO of the Mars Express 
Orbiter it is possible to derive high level products such as DTMs, ortho photos and 
shaded reliefs from the imagery. 

The principle of the transformation from object (X, Y, Z) to image coordinates (x, 
y) is explained in [2]. The starting point is the set of collinearity equations [4] : 
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The EO refers to a camera coordinate system common to all CCD lines and is ex- 
pressed for a given readout cycle n as X 0 , Y 0 , Z 0 , tp, to, k. The interior orientation (10) 
parameters x 0 , y 0 , c are defined in the image coordinate system, three separate values 
exist for each line. The transformation between the image coordinate system and the 
camera coordinate system is given by AX 0 , AY 0 , AZ 0 , Atp, Aco, Ak, which have been 
determined in the geometric calibration for each line separately. M as well as D are 
rotation matrices, X is a scale factor. The image coordinates are given by x and y, 
which are derived automatically in this case via DIM. 
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The IO of the HRSC has been calibrated in a laboratory at Dornier, Friedrichshafen 
and has been verified during the six month journey to Mars by means of star observa- 
tions. So far no deviations from the calibration have been experienced so that the IO 
of the HRSC is considered to be stable. 



2.1 Image Matching 

Our matching approach follows a coarse to fine strategy which means the matching 
result is refined step by step through image pyramids. As input data the HRSC im- 
agery, the observed EO and the calibration data of the IO are needed. As an optional 
input it is possible to use a DTM as approximate information. On Mars a high accu- 
racy DTM derived from data of the MOLA instrument is available [10]. 

At first point features are extracted using the Forstner operator [6] and the images 
are matched pairwise in all combinations using the cross correlation coefficient as 
similarity measure. Each image is divided into subareas to ensure an even distribution 
of the tie points over the whole area. To reduce ambiguities and computing time the 
matching location and a search space for the corresponding feature is computed when 
transferring a feature from one image to the other. Since no epipolar geometry exists 
for linescanner imagery a feature in one image is transferred to the next image via 
equation (1). For the transformation from object space to image space as a function of 
the image line (readout cycle) n an additional condition (2) has to be applied where x 
points in flight direction. 

x(n)=x(n, X 0 (n),Y 0 (n),Z 0 (n),<p(n),(o(ri),ic(n))=0 (2) 

This problem can be solved using the well known Newton-method for the above zero- 
crossing detection where the derivative x'(n) is replaced by the pixelsize of the image. 

n 0 = initial value for the image line 

n i+ 1 =n i —x(n i )l pixelsize / = 0,1, ^ 

After matching all overlapping images pairwise in all combinations an undirected 
graph is generated. The nodes of the graph are the point features, the edges are the 
matches between them. This graph is divided into connected components. The next 
step is the generation of the point tuples, whereas one point tuple is characterised by 
the property that not more than one feature per image is admissible. The complexity 
of this problem can grow exponentially. Instead of using tree search or binary pro- 
gramming techniques a RANSAC (Random Sample Consensus) procedure [5] is 
applied. The method relies on the fact that the likelihood of hitting a good configura- 
tion (correct tuple) by randomly choosing a set of observations (features of the sub- 
graph) is large after a certain number of trials. The advantage of this method is the 
high probability of obtaining a good point. Including a geometric consistency check, 
the method also eliminates blunders [3], 

From the start pyramid level (lowest resolution) to the so-called intermediate level 
(medium resolution) feature based matching is carried out using the whole images. 
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Going down the image pyramid the image size increases, as well as the number of 
extracted features. Besides the heavily increasing computational time, the matching of 
the complete images would result in too many tie points for the camera orientation. 
Therefore the matching procedure is carried out only for selected ’’image chips”, 
starting below the intermediate pyramid level. This means that tie points are searched 
in areas only where points have been found before due to good texture [13]. 

To further refine the result Multi Image Least Squares Matching (MILSM) is carried 
out following the approach of Krupnik [9]. In this approach the tie points are matched 
in all images simultaneously. A detailed description of the implemented MILSM can 
be found in [7]. Because it is the most accurate matching technique available it is 
possible to further refine the result of the feature based matching. In our implementa- 
tion we can decide whether to apply MILSM or not for each pyramid level. To save 
computing time it is advisable to carry out MILSM only on the last level, which de- 
notes the original resolution. 

Finally, model points are derived via a forward intersection of the image coordi- 
nates of the tie points. They serve as an approximation for the reduction of the search 
space on the next lower pyramid level instead of the MOL A points. A more detailed 
description of the application flow can be found in [11], 



2.2 Bundle Adjustment Using Control Information 

In the bundle adjustment the concept of orientation images proposed by Hofmann et 
al. [8] is used. This approach estimates the parameters of the EO only at a few se- 
lected image lines, at so-called orientation images [12]. The EO for all other image 
lines is interpolated from the values at the orientation images. The differences for 
each image line can be considered as correction terms that have to be added to the 
interpolated values. This solution keeps the number of orientation parameters small 
and, what is more important, allows to exploit the good relative accuracy of the ob- 
served orientation parameters. The mathematical model for photogrammetric point 
determination with a 3-line camera is based on the well known collinearity equations 
( 1 ). 

The starting point of the discussion about bundle adjustment using a DTM as con- 
trol information is an approach presented in [12]. This approach uses a least squares 
adjustment with additional conditions to obtain a relation between a DTM and the 
bundle adjustment without control information. In case of Mars it is possible to use 
the MOLA DTM as control information. One suitable way is to use the terrain surface 
derived by MOLA points and fit the matched HRSC points into the MOLA DTM. 
This is advantageous because there are more MOLA points than HRSC points. 

At locations where HRSC points are available the MOLA data can be described as 
a local surface. The surface is defined either by three original MOLA points or by 
four points of a DTM grid, which are interpolated using the original MOLA meas- 
urements. In the first case the local surface is described by three irregularly spaced 
MOLA points, which stem from the original MOLA measurements. This structure is 
based on original MOLA points and the vertical distance d of HRSC point H to the 
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plane defined by M p M„ and M 3 (Fig. 1, left). In the second case, the HRSC points 
have to lie on a bilinear surface defined by four neighbouring DTM points, which 
enclose the HRSC point and show a grid structure. The distance d is defined as verti- 
cal distance between HRSC point H to the bilinear surface defined by the four points 
M’j, M’,, M’j, and M’ 4 (Fig. 1, right). In the current implementation the approach 
using the MOLA DTM has been applied because the advantages of using the MOLA 
DTM outweigh the usage of the raw MOLA points. 




Mi. .3: MOLA mesh derived from original MOLA points 
M’i.. 4: MOLA mesh derived from MOLA DTM points 
H: HRSC point p 

d: distance between HRSC point 
and MOLA DTM-surface 

M, 




Fig. 1 . Left: Structure based on original MOLA points. Right: Regular DTM grid 



The mathematical model of the bundle adjustment is given in equation (4): 
v x =f(X,Y,Z,x 0 ,y 0 ,c,X 0 ,Y 0 ,Z 0 ,(p,O),ic)-x i 

(A) 

v y = f {X,Y,Z,x 0 ,y 0 ,c,X 0 ,Y 0 ,Z 0 ,(p,co,ic)-y i 

with: 

Xq — x Bo r 0 -r*+ro. — z B o +z 0 , (p — (p B + (p, co — co B + tu, k — k b + k 

whereas the EO is composed of biases ( X Bo ,Y B<! ,Z Bo ,(p B ,a> B , Jc B ) valid for the entire strip 
and terms ( A 0 ,T 0 .Z 0 ,^.®. r) valid for a single CCD line only. 

Additionally one observation equation (5) is used for each HRSC point 

v d +d = f(X H ,Y H ,Z H ,X Mr Y Mr Z Mi ) i = 1..4 (5) 

with three unknowns (X, Y, Z of HRSC tie point), one observation (difference d be- 
tween HRSC point and MOLA surface) and twelve constants (X, Y, Z for all four 
MOLA DTM points) for each surface. The accuracy of the observed difference is 
determined by the accuracy of the MOLA points. 



3 Processing of HRSC Imagery 

In this section, first the used HRSC imagery will be described. In the second part the 
results of the matching and bundle adjustment will be presented and discussed on the 
basis of the orbits 18 and 68. 
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3.1 Data 

For the evaluation of the matched tie points and the achieved accuracy of the bundle 
adjustment, imagery of the orbits 18 and 68 have been chosen which have been re- 
ceived in the early phase of the Mars Express mission. The observations of the EO 
and the calibration data of the IO as well as the MOLA DTM are used as input for the 
DIM and the bundle adjustment. The a priori accuracy has been introduced into the 
bundle adjustment with a value of 1000 m for the position and 28 mgon for the atti- 
tude. The trajectory of the orbiter is considered to be very stable. Additionally the 
HRSC imagery is used for the matching. 



L 

191 255 

Fig. 2. Left: Part of orbit 68 with high texture. Right: Histogram of region with low contrast 

The CCD arrays of the HRSC consist of 5176 active pixels each, which yields a swath 
width of about 65 km on the surface of Mars. The strips can have a length of up to 
300.000 lines, spanning about 4.000 km on the surface. Due to a limited bandwidth 
between Mars and Earth only the nadir channel is able to operate at full resolution. 
Generally the resolution of the two stereo channels has to be reduced by a factor of 2 
and the remaining channels by a factor of 4. To obtain an equivalent scale the nadir 
channel has to be resampled to the resolution of the stereo channels for the matching. 
Depending on the covered region on Mars the imagery shows areas with high texture 
and areas with hardly any texture and low contrast (Fig. 2). 




3.2 Results 

3.2.1 Results of the Matching 

In a first evaluation the ray intersections of the tie points are analysed. The values of 
the EO from ESA have been fixed in the bundle adjustment and no DTM as control 
information has been introduced. This can be considered as a forward intersection. 
The obtained values are compared to the results calculated by the bundle adjustment 
improving cp and k. This means a constant bias is estimated for both angles along the 
entire orbit. Biases for <p and k were introduced, because only these two parameters 
can be improved using tie points. 
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Table 1 . Theoretical standard deviations of the object coordinates 



orbit 


altitude [km] 


aX [m] 


ctY [m] 


aZ [m] 


18 


275 - 375 


11.0/5.9 


13.0/6.6 


34.0/18.0 


68 


269 - 505 


30.3 / 10.3 


26.6/10.9 


48.8/17.8 



In Tab. 1 the accuracies of the object coordinates of the ray intersections are shown 
for the selected orbits. The left value is the standard deviation of the ray intersections 
using the EO from ESA. The right value shows the achieved theoretical standard 
deviation of the ray intersections after improving cp and k. The accuracies of all com- 
puted orbits are in a range of about 6 to 11 m in X and Y, depending on different 
imaging altitudes. Z accuracies of all orbits are about 18 to 22 m. The standard devia- 
tions of the ray intersections are improved by a factor of 2 to 3 and a final accuracy of 
about 0.4 pixel in X and Y and 0.8 pixel in Z is achieved. 

3.2.2 Results of the Bundle Adjustment 

The second part of the results shows the evaluation after HRSC object points have 
been fitted to the MOLA DTM. Here, the biases of all six parameters of the EO (X 0 , 
Y 0 , Z 0 , (p, co, k) have been improved along the trajectory. Tab. 2 shows the improved 
values and their standard deviations for the three orbits. In most cases the values can 
be determined with high significance, because the standard deviations of the bias 
values are lower than the bias values themselves. 

The standard deviations of the object coordinates for the orbits 18 and 68 are 
shown in Tab. 3, which depend on two results. At first there are the accuracies of the 
ray intersection (Tab. 1) determining the accuracies within the orbit itself. Second, 
there are the accuracies of the absolute orientation between orbit and MOLA DTM 
(Tab. 2). Thus, the precision of the point determination is a combination of these two 
accuracies. The standard deviations of the object points in all three dimensions are 
less than 20 m (Tab. 3). 



Table 2. Theoretical standard deviations of orbit determination 



orbit 




X 0 [m] 


Y 0 [m] 


Z 0 [m] 


9 

[ mgon] 


CO 

[mgon] 


K 

[mgon] 


18 


bias value 


90.4 


-64.6 


-38.2 


-51.1 


-64.4 


-6.2 


bias a 


7.3 


11.0 


1.6 


0.3 


1.5 


0.1 


68 


bias value 


-12.1 


-112.3 


-41.2 


-24.9 


-12.1 


-35.9 


bias a 


10.7 


16.7 


6.7 


0.4 


1.9 


0.6 



Table 3. Theoretical standard deviations of HRSC points fitted to MOLA DTM 



orbit 


a X [m] 


a Y [m] 


a Z [m] 


18 


9.1 


10.6 


17.0 


68 


14.4 


16.7 


17.5 
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Finally, the root-mean-square (RMS) Z differences between object coordinates of the 
HRSC tie points and the MOLA DTM were investigated. In one case the result is 
computed without DTM as control information and in the other case with DTM in- 
formation. The RMS Z differences between DTM and HRSC object points are in the 
range of 200 m (orbit 18: 177 m, orbit 68: 200 m). After the bundle adjustment in- 
cluding DTM control information the RMS Z differences decrease by a factor of three 
(orbit 18: 84 m, orbit 68: 63 m). Therefore, the adaptation of HRSC data to the 
MOLA reference system has succeeded. 



4 Conclusion 

The results show the efficiency of the image matching and bundle adjustment ap- 
proaches to achieve an improved exterior orientation with MOLA DTM as control 
information. The tie points are distributed evenly over the whole block with a good 
rate of 3-fold points. An accuracy of 0.4 pixel in position and 0.8 pixel in height is 
achieved. The significant improvement of the position of the exterior orientation 
increases from an a priori accuracy of 1000 m to less than 20 m in all three dimen- 
sions (Tab. 2). The accuracy of the attitude increases from 28 mgon to 1-2 mgon in 
all angles. The position and attitude could be improved by an average factor of 30 to 
50. Thus, after the bundle adjustment the object coordinates of the tie points have a 
very high accuracy. Finally, there is a high consistency between HRSC points and 
MOLA DTM, which constitutes the valid reference system on Mars. 
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Abstract. The automatic acquisition of structured object maps requires sophis- 
ticated perceptual mechanisms that enable the robot to recognize the objects that 
are to be stored in the robot map. This paper investigates a particular object recog- 
nition problem: the automatic detection and classification of gateways in office 
environments based on laser range data. We will propose, discuss, and empir- 
ically evaluate a sensor model for crossing gateways and different approaches 
to gateway classification including simple maximum classifiers and HMM-based 
classification of observation sequences. 



1 Introduction 

So far robot maps primarily support safe and efficient navigation [2,7], see [11] for an 
extended overview of state-of-the-art mapping approaches. The next generation of maps 
will in addition provide better support for the achievement of service tasks. They will do 
so by explicitly representing the environment structure and by modeling relevant objects 
of the environment. 

In our previous research, we have proposed Region & Gateway Maps (RG Maps) 
as resources for autonomous mobile robots acting in structured human indoor environ- 
ments [4], RG maps are tuples ( R , G), where R denotes a set of regions and G is a set of 
gateways that represent the possible transitions between regions. A region has a compact 
geometric description, a bounding box, a list of adjacent gateways, and a set of models 
that represent the task relevant objects within the region. The second key component of 
RG maps are gateways, prominent and recognizable areas that connect different parts 
of the robot’s environment. The recognition of gateways allow robots to autonomously 
extract the environment structure and represent it in the map [8,3,1]. 

In two companion papers we have detailed our mechanisms for acquiring compact 
geometric descriptions of regions [4] and for the acquisition of models of rectangular 
task relevant objects [10]. This paper addresses the problem of automatically detecting 
and classifying crossing gateways. 

Gateways form perceptually recognizable, characteristic transitions between two or 
more adjacent regions. They can be traversed in any direction and are the only possibility 
to pass from one region into another. The partitioning of floor plans is based on gate- 
ways such as cross-ways, junctions, turns and narrow passages, see also figure 1 . In our 
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approach gateways are specified by a class label, adjacent regions, traversal directions, 
crossing-points and gateway-points that can be used for detecting when a gateway is 
entered and left. The set of discrete gateway points is derived from features extracted 
from a single laser scan (see section 2). Pairs of these gateway points form passages, 
the robot can pass through. Narrow passages or open-close-transitions are character- 
ized by a single pair of gateway points, whereas multi-passage gateways like junctions 
for example contain several passages. It is also possible to combine multiple gateway 
structures as encountered in office environments (refer to Fig. 1). We will focus here on 
crossing gateways, i.e. gateways which connect hallway regions. The detailed concepts 
and properties of such gateways can be found in [4]. 

The computational problem of gateway recognition and classification can be for- 
mulated as follows: Given a single scan or a sequence of scans provided by a laser 
range finder and a set of gateway models, the robot autonomously detects and classifies 
crossing gateways. We will solve the gateway recognition problem in a computational 
process that executes a sequence of three steps: (1) Generating hypotheses for virtual 
line models (VLMs) (sec. 2), (2) Determining weights according to general and specific 
gateway models (sec. 3), (3) Using the generated observation vector for classification 
(sec. 5). Finally, we empirically evaluate the proposed methods (sec. 5) and conclude. 




Fig. 1. Left: Classes of Gateways - Right/Left Turn (1,2), X-Crossing (3), T-Junction/Forking 
(4,5), Narrow Passage (6), Right/Left Opening (7,8), Combination of Gateways (9); (• gateway 
point, o crossing point, •(— traversal direction, ... region border); Right: Example environment - 
letters denote gateways, numbers denote regions 



2 Generating Hypotheses for Virtual Line Models (VLMs) 

In order to represent gateway hypotheses we propose virtual line models (VLMs) as an 
appropriate feature language, VLMs are based on the assumptions that environments are 
rectangular and hallways have approximately the same width. The VLM consists of a 
left, right and front virtual line as well as a hidden virtual line (Fig. 2). To generate VLM 
hypotheses, we first extract low-level features, i.e, virtual lines and depth singularities 
from the line segment and point scan, respectively. In the next step, those virtual lines 
are grouped to form hypotheses with respect to the VLM in Fig. 2. In the first processing 
step the algorithm generates a line segment scan Lps from the point scan Lp by the 
means of linear regression according to [6]. 
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Virtual Lines. Line segments from Lps which lie approximately on the same line are 
grouped and represented by that line, also referred to as virtual line (see Fig. 2). 

Depth Singularities. This point feature is extracted from L p and denotes discontinuities 
in the distance measurements of a laser scan, see Fig. 2. The parameter Ad m i n indicates 
the minimum distance difference of two succeeding distance measurements to represent 
a depth singularity. Pi £ Lp is the point where the distance measurement di ends. P’ ls 
are the points at the depth singularities. 

Pds ~ • ( di <C fiiibl) A (| di t(z±l| ^ Admin)} 



C 



left 

virtual 


hidden 

virtual 

line 


front 

virtual 

line 


C 






right 

virtual 

line 









depth 

singularities. 




Fig. 2. Point scan of an X-Crossing (left); Virtual Line Model for Crossing Gateways, i.e. X- 
Crossing, L/R-Turn, T-Junction (middle); Depth singularities in a laser scan (right) 

Virtual Line Grouping. Based on the virtual lines and depth singularities we generate 
hypotheses for VLMs, which signal that a crossing gateway of some kind may be present. 
To generate candidates for the virtual left and right line, we search for parallels among 
the virtual lines, where the robot is in between. Virtual front lines intersect a pair of 
parallels approximately in a right angle and in front of the robot. Finally, we estimate 
the virtual hidden lines. Therefore, we consider depth singularities, that are close to the 
virtual left or right line. The hidden line is constructed such that it is parallel to the 
virtual front line and intersects with the given depth singularity. To deal with situations 
where no valid depth singularities are present, we add hypotheses where the estimation 
of the hidden line is solely based on the environment assumptions. As a result we obtain 
a set of annotated virtual line quadruples, which represent hypotheses for VLMs. The 
gateway points are defined by the intersections of those virtual lines. 



3 Evaluating the VLM Hypotheses 

We evaluate gateway hypotheses by assessing the similarity of a perceived VLM and a 
specific gateway class. Therefore, we propose the following measures: 

1 . rectangularity and distance measure to reflect the general model quality and 

2. freespace measure to account for the match with a specific gateway class. 

As a result we obtain an observation vector for each VLM hypothesis. Additionally, 
we track VLM hypotheses over consecutive measurements while the robot is moving 
towards the gateway to generate observation sequences. 

Distance Measure. The expected hallway width dhw has been manually measured. 
Deviations from this value are weighted according to: 
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^ distance 



I di dh w | 
dhw 



w d = 



v 4 w i 
Z-^i—1 distance 



Whereas di is the Euclidian distance between two neighboring gateway points and I V',; 
denotes the averaged distance weight. 



Rectangularity Measure. The rectangularity criterion refers to the inner angles a* 
(i = 1...4) of the convex quadrangle, given by the VLM. We define the rectangularity 
by the deviation of the inner angles a, from f . 



W T = 1 - 



Etik-fl 

2tt 



Freespace Measure. Considering the VLM as depicted in Fig. 2 (right), we define 
three pairs of gateway points, namely on the virtual left, right and front line. According 
to those pairs of gateway points, we divide the sensor data into three sectors, Fig. 3. 
Each sector Si comprises Ns, measurements. Based on those definitions we propose 
the freespace measure (FSM) as a quantity for the match of a hypothesis to the sensor 
data. In each of the three sectors the sensor measurements should either be close to a 
given line (On Line FSM ) or should cross a given line (Over Line FSM). The gateway 
class determines which of the two FSM variants applies to a certain sector. For example, 
considering an L-Turn the measurements in the front sector are expected to match the 
virtual front line (On Line FSM). Whereas for an X-Crossing, measurements in the same 
sector are expected to cross the virtual front line (Over Line FSM), see Fig. 3. 

On Line FSM. P t denotes a laser measurement from the point scan Lp and d( P 7 ) is the 
respective distance measurement. We compute a point Pi)' 1 on the considered virtual line 
and its distance to the robot d(P)’ 1 ), whereas P, and P)’ 1 lie on the same ray from the 
robot. Then we count all measurements for which the difference of d(P,) and d(P)’ 1 ) 
lies between a given lower and upper threshold. Finally, we normalize this on-line-count 
(C 0 i) with the overall number of measurements in the sector: 



w : 



line 



= C b oi/N Si 



Over Line FSM. This measure only applies to the front sector. We construct a line l par 
parallel to the virtual front line and set back by a given distance. Then we count all mea- 
surements which intersect l par , and normalize the resulting over-line-count. Analogous 
to the On Line FSM we get Wz) Une , see also Fig. 3. 




Fig. 3. From Left to Right: Freespace measure (FSM) for different cases - On Line FSM (1,3), 
Over Line FSM (2), (• gateway point, — laser scan measurement, - - line for free space evaluation); 
FSM configurations for XCrossing (4) and LTurn (5) 
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Generating Gateway Weights and Observation Sequences. Utilizing the proposed 
measurements, we define weights for each VLM hypothesis with regard to a certain 
gateway class GW : 

3 

W(GW, VLM) = . (■ W d (VLM ) + W r (VLM )) + % • ^ W% M {GW) 

i—\ 

Whereas f v i rn and f gw denote weighting factors for the general and gateway specific 
measurements, respectively. As a result we obtain an observation vector for each VLM 
hypothesis, where the entries quantify the similarity of the hypothesis to a specific 
gateway class. In most practical cases the mobile platform approaches the gateway area. 
Thus, we observe the same VLM hypotheses from different positions, where the distance 
to the gateway is continuously decreasing. The VLM hypotheses tracking is based on the 
gateway points and Euclidian distances. If all gateway points of two VLM hypotheses 
have an approximate match, they are considered to be identical. Based on this tracking, 
we obtain sequences of observation vectors. A sequence starts when the hypothesis is 
first observed and the distance falls below a threshold. It is finished or corrupted when 
it is either lost or the robot enters the gateway. 

4 Classification 

We now investigate the computational task of classifying the obtained observation se- 
quences with regard to the introduced gateway classes, by means of the following clas- 
sification methods: based on the observation vector closest to the gateway, weighted 
average over all observation vectors in a sequence and Hidden Markov Models. 

1. Single Observation and Averaged Sequence Based Classification 

Observations close to a gateway imply a more complete coverage of the gateway area 
by the sensors, hence they are in general the most informative. The single observa- 
tion classifier (SOC) considers the maximum weight to determine the gateway at hand. 
This approach demonstrates the discrimination power of the freespace measurements 
and the resulting weights. It is, however, very sensitive to sensor noise, occlusions and 
dynamic changes in the environment. A simple alternative is the fusion of consecutive 
measurements by calculating a weighted average over the observation sequence, where 
the weights are inversely proportional to the distance. Afterwards, SOC is used to de- 
cide which specific gateway is present. Whereas the approach considers the complete 
observation sequence, it does not fully exploit probabilistic properties of observations 
and temporal relations between them. 

2. Classification Based on Hidden Markov Models (HMMs) 

A more promising approach to gateway classification is the use of HMMs. They provide 
mechanisms to model temporal structures in sequences, by the use of probabilistic ob- 
servation and state transition models. A detailed description of the theory can be found 
in [9]. In the next paragraphs we briefly outline the steps necessary to use HMMs in the 
context of gateway detection based on the introduced sensor model. 

Clustering the Data and Initializing the HMM. Since our sensor model provides 
continuous measurements we use HMMs with continuous outputs. To deal with the 
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implicated complexity of such HMMs we explicitly cluster the data using the k-means- 
algorithm [5], The clusters are then used to built the observation model (mean and 
covariance matrices), and to define the structure of the HMM. Since the coverage of the 
gateway area by the sensor differs for different positions, we expect the data to represent 
clusters for different distance intervals, and we compute the start values for the k-means- 
algorithm accordingly. This assumption is verified by the fact that the mean values are 
only altered slightly by the k-means clustering. To get a further intuition we labeled each 
observation vector according to distance intervals: I\, I 2 , Id- After the clustering we 
sorted the clusters according to the intervals, and counted how much of the prelabeled 
data has been assigned to which cluster. In Table 1 it can be seen that all of the clusters 
contain a reasonable amount of samples (over all sum), and that clusters are built ac- 
cording to distance intervals (max/min distance). They contain either observations from 
disjunctive or slightly overlapping distance intervals or represent different distributions 
for the same interval. Those findings are very important for the choice of the HMM 
structure and initialization, but they also allow for interpretation of the learned model. 



Table 1. Clustering for T-Crossing data, columns refer to different cluster, rows depict cluster 
properties; all distance measurements in millimeter 



cluster id 


1 


2 


3 


4 


5 


6 


7 


8 


/ 1 = 7m. .Am 


475 


508 


685 


795 


2 


1 


0 


0 


I 2 = Am.. ,2m 


0 


0 


0 


16 


277 


252 


1143 


44 


I 3 = 2m... 


0 


0 


0 


0 


0 


0 


32 


437 


over all sum 


475 


508 


685 


811 


279 


253 


1175 


481 


mean distance 


6112 


5796.4 


5770.2 


4455 


3796.8 


3538.3 


2706.2 


1755.5 


max distance 


6872.2 


6956.1 


6976.3 


4968.7 


4009.4 


4103.6 


3591.7 


2310.3 


min distance 


4854.5 


4960.8 


4878 


3954.8 


3438.6 


3037.8 


1775.9 


1427.9 



To initialize the HMM all clusters Ck that cover approximately the same distance 
interval are assigned to the same HMM state ,S) . More precisely, the covariance matrix 
and the mean of each G\ add a dimension to the observation model of S, . Considering 
Table 1 , we obtain an HMM with five states, where [C\ , C 2 , C 3 ] present the first state, C 4 
the second, [C 5 , G' (l ] the third, C '7 and Cg the fourth and fifth, respectively. The mixture 
matrix M m i X is initialized uniformly, the dimension is given by the number of states Q 
and the maximum number of mixture components M. The states are arranged to form a 
left-right HMM, and according to the sequences, left refers to large and right to small 
distances. Although, in a left-right HMM consequently all entries below the diagonal 
of the transition matrix T are zero, we initialized the full matrix with 1/Q. Fig. 4 (left) 
shows a left-right model, where the arrows denote the possible transitions from state Si 
to Sj with probability p t j . 

Learning and Evaluation of the Hidden Markov Model. We fix the observation model 
obtained from clustering and use expectation-maximization (EM) learning to determine 
appropriate values for T, M mix and the state prior, according to [9] . Since the observation 
space given by our sensor model is filled very sparsely and the covariances of the data are 
all considerably small, we encountered problems of overfitting. That means, observation 
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0.2 0 0 0 \ 

0.58 0.417 0.003 0 

0 0.37 0.63 0 

0 0 0.68 0.32 

0 0 0 1 / 

Fig. 4. Graph of left-right HMM (left) and learned HMM for case presented in Table 1 (right); 
Also given is the transition matrix T, presenting the respective transition probabilities ptj. 

probabilities tend to zero and cause numerical instabilities. To anticipate those problems, 
we add noise to the clustering data, to artificially spread the distributions. The task to find 
the best HMM is to optimize the learning with regard to the number of clusters and the 
noise to be added. Too many clusters cause some clusters to not cover a sufficient number 
of samples, and too few reduce the discrimination power. On the other hand too much 
noise reduces the discrimination power but increases the generality of the model. None 
or little noise results in over-selective HMMs. By now we semiautomatically search for 
an optimal solution. Fig. 4 (right) shows the graph of the HMM and its transition matrix 
that were learned for the case presented in Table 1 . As expected we obtained a left-right 
model (no backward transitions), and most states are only connected with the next state. 

5 Experimental Results 

In this section, we will empirically evaluate the proposed approaches. To acquire a suf- 
ficient amount of data for different hallway environments we used a simulator (RHINO 
Navigation Software, also applied in [2]) which provides laser measurements, based on 
the sensor model of the real SICK LMS200 laser range finder. Also, we annotated the 
maps, in order to automatically label the recorded observation sequences. As a result we 
obtained about 200000 observation vectors for eight different environments, which adds 
up to approximately 20000 observation sequences (divided in training and test data). The 
environments differ in the amount of clutter that is present, and the width of hallways 
(2, 2.5 and 3 meter). The environment depicted in Fig. 1 is referred to as 2m uncluttered. 
For images of all environments refer to our homepage. 

It can be seen from Tab. 2, that in some cases the recognition rate for the single 
observation classifier is very high, but in particular for URTurn it is rather poor. This is 
due to the fact, that the last observation is not necessarily the best, e.g. when the robot 
is cutting the edge in a left or right turn. The classifier ’’last but one” in Tab. 2 works 
like the SOC but considers the observation before the last. It improves the classifica- 
tion for some classes, but for others, like TCrossing , it slightly degrades. That means, 
it is difficult to determine which single observation should be used for SOC. Further- 
more, the classification is slightly worse, when clutter is present, due to ambiguous 
measurements. For the averaged sequence classifier (ASC) the classification is strongly 
dependent on the weighting function, the more we rely on the closer measurements the 
better. Whereas the classification results are comparable to the SOC, we gain a little 
more robustness due to the averaging. The classification results can be improved when 
the weights for Hallway/Deadend are ignored, but this way it is difficult to evaluate 
ambiguous situations. 






( Si : — *- ( S 2 ) — :( Ss ; — 1 S„)-*-(S s ) ■ Si — ► Si;-*-; S 3 )-*-( S 4 )-»-( S s ) 
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Table 2. Classification results for the SOC (last obs and last but one) and the averaged sequence 
classifier (ASC) (averaged seq); DeadEnds have 100% recognition rate for all cases. 



Gateway class 


XCrossing 


TCrossing 


LTCrossing 


RTCrossing 


LTurn 


RTurn 


2m uncluttered 


581 


390 


200 


198 


99 


106 


averaged seq 


86.8% 


96.9% 


98% 


99.5% 


14.1% 


41.5% 


last obs 


86% 


100% 


95.5% 


97.5% 


50.5% 


100% 


last but one 


86.8% 


100% 


99% 


99.5% 


100% 


100% 


2m cluttered 


148 


169 


70 


42 


30 


38 


averaged seq 


83.7% 


39.6% 


90% 


61.9% 


6.7% 


2.6% 


last obs 


95.6% 


93.5% 


77.2% 


57.2% 


40% 


94.7% 


last but one 


98.2% 


90.5% 


97.2% 


59.5% 


46.7% 


97.4% 



Table 3. HMM based classification; given data of two gateways to the two respective HMMs. 



environment 


2m uncluttered 


2m cluttered 


2.5m cluttered 


3m uncluttered 


lturn/rturn 


100/100% 


100/100% 


96.3/100% 


100/100% 


lturn/tcrossing 


100/100% 


100/93.5% 


98.75/93.5% 


100/100% 


rturn/tcrossing 


100/100% 


100/92.9% 


100/23% 


100/100% 


lturn/xcrossing 


100/100% 


100/100% 


99.4/100% 


100/100% 



The EM learning converged to left-right HMMs with expected apriori probabilities 
for all types of sequences and training data from one or more environments. When 
we train the HMM for a single environment only, the classification rate is 100 percent 
for the respective test data, which shows the validity of the HMM approach. Since it is 
difficult to determine the optimal HMM for a certain gateway type and data from different 
environments, we did not yet obtain an optimal set of HMMs to handle all classes with 
satisfying discrimination power in the general case. But we give examples for pairwise 
classification in Tab. 3. The experiments have been performed on the same test data used 
for the SOC/ASC evaluation. It can be seen that the hallway width does not influence 
the discrimination power, but as for the SOC/ASC, the recognition rate decreases in the 
presence of clutter. Besides the difficulty of finding the optimal HMMs, the presented 
approach seems to be very promising in the context of the automatic generation of 
structured robot maps. The advantage is, that the resulting HMM based classifier provides 
probabilities for observation sequences with regard to the different gateways. Thus, it is 
possible to globally fuse the results of different observation sequences in a very formal 
way by the means of a Bayes filter. 



6 Conclusion 

In this paper we proposed a sensor model for the detection and classification of different 
classes of crossing gateways. The model is based on the virtual line model (VLM) and 
different general and gateway specific measures, that enable us to assess the similarity 
of the perceived sensor data and the different gateway classes. As a result we obtained 
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observation sequences for when the robot is approaching gateway areas. We investigated 
the properties of that data, and showed that it is a discriminating feature language well 
suited for the given task. We proposed three classifiers, based on the generated obser- 
vation sequences. The simple classifiers perform well for certain classes in uncluttered 
and static environments, but the tuning of some parameters, like the weighting function, 
are neither trivial nor very general. Also, it is difficult to handle exclusion classes like 
Hallway/DeadEnd without decreasing the performance. On the other hand, we presented 
theory and experiments for the HMM based sequence classification. It could be seen, 
that the approach is very promising, with regard to global fusion of observations and 
reasoning under uncertainty, but learning appropriate models is a challenging task. 

The next step is to learn the set of HMMs for all classes of gateways, so as to 
maximize the discrimination power across the set of HMMs, and also the tolerance to 
changes in the environment. 
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Abstract. A pipelined parallel highspeed image processor implementation in an 
FPGA for applications in close range photogrammetry is described. The bottleneck 
of high speed photogrammetry is the accurate sub-pixel measurement of retro- 
reflective targets. We use an enhanced sobel edge detector and a special Ailing 
algorithm to compute the segmentation of the targets. The segmented regions are 
labeled and the weighted center of gravity is computed for each region. 

The incoming image data is processed in realtime. To achieve a high throughput 
the pixel based processing is done for ten image columns simultaneously. A total 
throughput of over 660 million pixels per second has been demonstrated with a 
design clock of only 66 MHz. 

An automotive application of the image processor is presented, that measures the 
3-d wheel position of a driving car. 



1 Introduction 

Close range photogrammetry is widely used for precision measurements in static scenes. 
The generic case is a single camera that is used to take many pictures of the measurement 
object. In an offline process the target positions are extracted from the images and 
the 3-d co-ordinates of the targets as well as the camera locations and calibration are 
computed. For time resolved short-term measurements of moving objects film and video 
cameras with various frame rates are used. The image processing is still done off-line. 
The measurement time is typically limited to a few seconds due to limits in memory size 
or film length. 

1.1 Motivation 

There are many interesting motion processes that require both, a high speed image acqui- 
sition and long term measurement. High speed recording of images is technically possible 
but expensive. The evaluation of the recorded images is usually very time consuming. 
This was the reason to create a system that can extract the important information from 
photogrammetric images in realtime. The photogrammetric evaluation relies mainly on 
the coordinates of the target centers and the target size. Since this information needs less 
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(a) (b) 



Fig. 1 . a) An image of the high speed camera mounted on a car. b) The high speed photogrammetry 
system consisting of a camera, a high-energy LED-flash and an integrated FPGA board (not 
visible). The image coordinates are transfered over a TCP/IP connection. The housing has a width 
of approximately 35 cm. 



than 150 bits per target, the data reduction factor exceeds 600. The image coordinates can 
be stored very compact or transfered to a computer to be processed by a photogrammetry 
software, that computes 3-d information in real-time. 



1.2 Premises 

Figure 1(a) on page 563 shows a typical image of a photogrammetric measurement. 
The targets have a high contrast and the background is mainly black. This is a result of 
using a ring light source around the objective and retro-reflective material for the targets. 
Because of this nicely choosen conditions a grey level threshold is often sufficient to 
separate the targets from the background. More robust than a grey level threshold is an 
edge based segmentation, that is used here. 

The challenge is, to do the image measurement robust and in real time. The camera 
we use has a resolution of 1280 x 1024 pixels and a frame rate of up to 485 Hz. The 
resulting data rate is 660 million pixels per second. We developed a stand alone board that 
consists chiefly of a CameraLink interface for the connection to the camera, a Virtex-II- 
Pro XC2VP30 FPGA, a micro-controller and an ethemet interface for the connection to 
the host computer. The following sections describe the algorithm and the implementation 
of the image processing on this FPGA. 



2 Algorithm 

The computation of the image co-ordinates of photogrammetric targets is usually based 
on template matching or on the weighted centre of gravity (WCG) method. Template 
matching is slightly more precise for small targets but needs more resources [1,2], The 




564 



G. Wiora, P. Babrou, and R. Manner 



WCG co-ordinates u wcg and v wcg of a target are given by 

_ ^ bu,v * 9u,v * V ^ bu,V * 9‘U,V V .. . 

Vwcg — 7 j Vwcg — j • 1 1 ) 

Ou,v * 9u,v ®u,v * II a. v 

Where g U)V is the grey value of a pixel with the coordinates u and v. The quality of the 
WCG co-ordinates depends very much on the proper selection of the pixels that are in- 
cluded into the calculation. This is controlled by the binary function b u v . It determinates 
the pixel group that belongs to a certain target. 

2.1 Segmentation 

The computation of 6 U V is designated in image processing as segmentation and is done 
here in three steps: 

1 . searching the target contour 

2. filling the target contour 

3. connecting the line segments of a target 

The detection of a contour can be grey level based or edge based. A good example 
for a grey level based method is given in [3]. The edge based approach is more robust 
against shading variations since it does not rely on a global grey level threshold. When 
searching the contours not only the presence or absence of an edge at a given position 
is interesting but also the edge direction. 



Edge Detection: A fast and noise insensitive method of finding the local edge direction 
is the Sobel operator S = iS c | + >S,. |. Each of the partial operators is the combination of 
a directional smoothing kernel and a derivation that is rotated by 90° to the smoothing 
direction. The local edge direction 0 U _ V is given by the application of the two parts on 
the grey image g UtV with the center position ( u , v): 



= tan 



— 1 y * 9u,v 

' [Ju.'i 



(2) 



In [4] is shown that the error of the local direction can be reduced by a factor of 7, just 
by using different coefficients in the sobel kernel. We use this optimized sobel kernel: 





'30 -3' 




3 10 3' 


S u = 


10 0 -10 


, S v — 


0 0 0 




3 0-3 




-3 -10 -3 



The normalization factor of 1 / 32 is left out since the absolute value of the Sobel is not 
interesting for edge detection. 

The criterion C for the existance of an edge is the sum of the absolut values of the 
two sobel parts or the edge strength: 
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The edge image c U A! is obtained from C u , v by binarizing it with a given threshold. This 
binary edge image is robust to noise in the grey image due to the smoothing part in the 
Sobel operator. To allow an inside-outside distinction while scanning the image line by 
line, the local edge direction is used. For a horizontal scanning it is sufficient to watch 
the sign s of the Sobel operator’s horizontal part: 

s = sign (S x ■ g UjV ) (5) 

For bright targets on a dark background it is negative at the left side of its border and 
positive for the right side. 

Region Filling: The second step of the segmentation is filling. On the basis of the edge 
criterion c and the edge direction s the contours can be filled. The result of the filling 
operation is the binary function b Ut „ which defines closed regions. The whole processing 
can be done line by line and pixel by pixel up to this step and is thus easy to parallelize 
and not memory intensive. 

Computation of the WCG moments: The next step of the algorithm is the intraline 
computation of the WCG moments. This happens separately for each line segment of 
each target. The following moments have to be computed and stored: 

bu,v ’ 9u ,t >5 ^ ^ b u ,V ’ Qu,V * ^5 ^ ^ b u ,V ’ Qu,V ' ^5 ^ ^ bll,V ( 6 ) 

Also the horizontal extents u m - ln and ti max of the target line and the vertical position v 
are saved. The moments are summed up when adjacent line segments are fusioned. 

Connecting Line Segments: The next step of the algorithm is the connectivity analy- 
sis. For each line segment the moments and the horizontal extents are known. To find 
connected line segments it is sufficient to compare the horizontal extents and u lnax 
of the current target’s line segment with the line segments of the previous line which are 
stored in a FIFO. The result of the comparison defines a status to the moment. There are 
three possibilities: 

1 . The current segment belongs to a new target. 

2. The current segment belongs to the oldest target in the list. 

3. The oldest target in the list is done. 

In case 1 the current segment is stored in the FIFO. In case 2 the oldest target in the list 
and the current segment are combined and stored in the FIFO. In case 3 the oldest target 
in the list is moved to the output buffer. 

2.2 Classification 

The target area which is given by ^ b u>v and the minimum (u m i n , f m in) and maximum 
extents (it max , u max ) of the targets are used to do a pre-classification of the targets to 
reduce the amount of noise in the data output. Targets that are too small or too large or 
have a large aspect ratio are filtered. 
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3 Implementation 

The block diagram for the implementation 
of the above described algorithm is shown 
in fig. 2. The goal was to realize a real time 
target measurement with minimum latency 
and maximum throughput. The camera de- 
livers the pixels with a clock frequency of 
66 MHz. Ten pixels with each 8 bits are 
transfered in one clock cycle. That means 
the input vector for the design is 80 bits 
wide. 

3.1 Segmentation 

Since the kernel of the Sobel filter is 3 x 3 
pixels wide three lines of image data have 
to be stored in the input register wich con- 
sists of FIFOs. In practice we need two 
lines plus 20 pixels because of pipeline de- 
lays. The FIFO stucture allows the simul- 
taneous access to 33 grey values from an 
12 x 3 pixel wide area. According to (3) 
the grey values are multiplied by 3 and by 
10 and the sums for each of the 10 center 
pixels are calculated. 

The filling process is implemented with 
logical functions for 10 pixels parallel. The 
input information used for this is the bina- 
rized edge strength c u and the horizontal 
edge direction s u from (5). The fill status 
of a pixel b v+1 depends on the following 
Bool expression: 

bu-\-i,v — Cu - V (7) 

bu,v A b u - i A (c UjV A (su,v ^ 0)) 

Verbaly that means that a pixel in b u+l v 
is filled if the pixel is an edge, or if the 
pixel left of it and the pixel above it is filled 
and the pixel left of it is not a right edge. 
This expression is computed for 10 pixels 
simultaneously. 




Target ready Data output 

125 Bit per Target 



Fig. 2. Scheme of the implementation. 
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3.2 Intra Line Computation of the WCG 

The minimum size of a target is one pixel. Since the Sobel operator finds an edge to 
the left and to the right of this single pixel the minimum region size of a target is 
three by three pixels. The minimum distance of two targets is two pixels. If there is 
only one pixel distance between two targets they are handled as one target. This means 
that the maximum number of targets in a ten pixel block is three. Therefore the intra 
line computation needs three identical ALUs to compute the moments for three targets 
simultaneously. The inputs of the ALU s are the grey values g u , the horizontal and vertical 
momentums u- g u and v ■ g u . the coordinates u and v and the binary segmentation function 
b u . Since v is constant in a line it is left out in the indices. 

The vector b u is segmented into a maximum of three connected areas by the ALU 
control unit. Each area is directed to one of the three ALUs. 

Since the processing of one input block with 10 pixels has to be done in parallel, 
all components of the ALU can handle all input data parallel. One ALU includes the 
following components: 

- 10-input summation unit for 22 b u ■ g u 

- 10-input summation unit for 22 b u ■ g u ■ u 

- 10-input summation unit for ^2 b u ■ g u ■ v 

- summation unit for 22 b u tV 

- recording tt min , u max and v min , v max co-ordinates 

The output vector of each ALU is stored in a FIFO. Up to this point the whole design is 
based on the pixel clock and does the processing synchronously. The delay is less than 
4/is and results from the input buffer for two image lines. 

After this point the processing continues in a fully serial pipeline with list based data 
and does not rely on the pixel clock any more. This point is well suited to be used for 
the clock domain change from the input pixel clock to the system clock of the FPGA. 
That is done by three FIFOs and a parallel to serial converter (MUX). 

3.3 Connectivity Analysis 

The connectivity analysis unit compares the u rn j n and u max co-ordinates of the current 
line v moments from MUX and the oldest moments in the dual ported RAM from line 
v — 1. The result of the comparison are the control signals for the column ALU and the 
read and write pointers for the RAM. If the co-ordinates of the moments from MUX 
and RAM show overlapping regions the two segments belong to the same target and the 
moments can be summed. The resulting moments are written back to the RAM. 

If the current target is to the left of the oldest target in the RAM (u max ^ v < Wmin,jj-i) 
the current target is new and is written to the RAM. If the current target is to the right of 
the oldest target in the RAM (w m > w max ,„_i) the target is removed from the RAM 
and written to the data output. 

4 Experimental Results 

To demonstrate the maximum data rate we acquired a sequence of a few targets on 
a rotating fan and about 20 static targets in the background. Figure 3(a) on page 568 
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(a) (b) 

Fig. 3. a) The image space trajectory of two targets on a fan recorded with 485 frames/s. b) The 
steering angle A in degrees of a slalom manoever over 21 seconds. 



shows the trajectory of two targets for one revolution. The fan was rotating with about 
2000 RPM. The system acquired the image coordinates with the full data rate of 485 
frames per second. The time step between two positions is 2.0625 ms. 

The target measurement stability was tested in the application setup that is described 
below. A sequence of 10000 images with stable camera and object position was acquired 
and evaluated. The standard deviations cr u and a v of the image point coordinates were 
below 0.009 pixels. The maximum deviation A u and A v from the average coordinate 
was below 0.05 pixels. 

A good measure for the stability and consistency of the whole system is the pho- 
togrammetric bundle adjustment that is used to calibrate the camera. A calibration with 
48 images and 172 object points had a standard deviation of image point co-ordinates 
below 0.033 pixels. 



5 Application 

The image processing module was developed for the WheelWatch photogrammetry sys- 
tem that can measure the wheel position of a driving car in realtime [5,6]. The camera 
head with the integrated fpga module is shown in fig. 1(b). 

The car body and the wheel are signalized with retroreflective targets like shown 
in fig. 1(a). The 3-d co-ordinates of the targets are measured with static photogramme- 
try before the test drive begins and are thus well known. The measurement camera is 
calibrated with a photogrammetric standard procedure. 

In the measurement mode the relative orientation of the wheel to the car body can 
be computed with a variant of the photogrammetric resection. The results are the six 
degrees of freedom of the wheel movement relativ to the car body: The position P of 
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the axis center and the orientation angles u>, 4> and k of the axis. From these values the 
steering angle A and the tilt angle 9 can be derived. 

In fig. 3(b) on page 568 the measured steering angle A of a 21 second test drive 
is shown. The results of the first test drives show, that the measurement of the wheel 
position with a precision better than 0. 1 mm and of the wheel angles with better than 
0.01° is possible, 

6 Conclusion 

6.1 Resume 

It has been shown, that real time photogrammetry with very high data rates is possible. 
This enables the use of photogrammetry not only in short term analysis like crash test, 
but also for long term investigations. A further use of the system is in control loops for 
positioning systems or robots which is only possible due to the short latency of only 4 
milliseconds. 

6.2 Outlook 

With the proposed system the image processing is not any more the limiting factor of 
photogrammetric data handling. The current implementation runs with a relatively low 
clock frequency of 66 MHz. If needed this can be increased easily by a factor of two or 
more. Since there are still 40% of free resources in the FPGA a further parallelization 
is also possible. The bottle neck is currently the CameraLink interface with a limit of 
660 Megapixels per second. To increase this any further the FPGA needs a more direct 
connection to the camera chip which would reduce the modularity of the system. 
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Abstract. A new methodology to determine and correct lens distortion 
typically occurring in low cost camera systems is presented. The method 
is easy to apply and leads to very accurate results with moderate effort. 
The method combines the radial lens distortion as the main part of the 
global distortion with the remaining weaker parts. Distortion models are 
presented and their validity are shown. The algorithm needs some refer- 
ence points with known coordinates. Examples and accuracy results are 
presented and discussed. 



1 Introduction 

Photographs or digital images are often used in industrial applications to perform 
quantitative measurements. These tasks should usually be solved by use of camera 
systems without lens distortion. Flowever, this is not always possible, or the costs for 
distortion free systems are too high. Using cameras that suffer from lens distortion, the 
measurements become erroneous. Here, an exact distortion correction may help to 
overcome the problem. 

A number of methods was published which obtain the parameters of the radial 
distortion function and correct the images [1-5,8-14], Usually, the determination of the 
distortion function is performed in the context of camera calibration. This requires 
considerable effort which should be sometimes avoided. 

Conventionally, lens distortion is described by a distortion function including ra- 
dial, decentering, and affine parameters. The main part of the distortion usually has 
the radial distortion. Therefore, in the majority of works no other distortion than the 
radial one is considered [1,2,4,5,8,9,11,12]. This may be sufficient, if the required 
measuring accuracy is not very high or if the actual distortion is sufficiently exactly 
described by the radial distortion. Some authors take into account decentering distor- 
tion [10,14], This may improve the correction. However, some other kind of distortion 
may still be present. 

Kruck [6] suggests an approach including some 30 parameters describing the 
lens distortion. However, the use of so may parameters brings some disadvantages. 
First, a powerful calculation system for processing the data is necessary. Second, some 
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of the parameters are not independent from each other. Third, the noise may influence 
the measurement and the reproducibility may be not sufficient. 

Our proper aims of this work were to find a simple and robust methodology for 
the exact correction of lens distortion effects. 



2 Distortion Models and Approach 

In order to solve the problem it should be taken into account that the main contribu- 
tion to the distortion results from the radial lens distortion. The next one is the decen- 
tering distortion, and finally, there can occur a number of small other effects which are 
often difficult to describe as a function of the image coordinates. 

Therefore, the deviation of the actual image from the ideal one, obtained by ap- 
plying the pinhole camera model should be called the global distortion. The global 
distortion is the sum of the radial distortion, the decentering distortion, and the re- 
maining distortion. We assume that the radial distortion has the highest contribution to 
the global distortion. 

2.1 Approach 

The approach for the determination of the global distortion is the following. First, it is 
assumed that only radial distortion occurs. The radial distortion is calculated and the 
result, i.e. the corrected image, is compared with the ideal image obtained by estima- 
tion of the ideal pinhole mapping. This estimation is obtained by fitting a projective 
2D-2D-transform to the known original point coordinates (these coordinates are actu- 
ally 3D but in a common plane, thus the z-coordinate may be set to be equal to zero) 
to the corrected image coordinates. The remaining error can include distortion errors, 
too, but these are compensated by the fitted projective transform. 

The difference between the corrected and the ideal coordinates is the input for 
further distortion determination which will be described later. 

2.2 Radial Distortion Model 

The description of radial lens distortion is commonly known from the literature [1-14]. 
The following (and even more) models can be applied: 

r'= r( 1 + a 2 r 2 + a A r 4 + ...) 



1 + b 2 r 2 + b A r 4 + ... 
r = r’( 1 + c 7 r' 2 +c 4 r' 4 +...) 



1 + d 2 r' 2 +d A r' 4 +... 
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Here, r is the undistorted and r ’ the distorted distance of an image point p=(x,y), 
or p’=(x’,y’), respectively, from the distortion centre S=(X$, T s ). The a„ b h c,-, and d t 
are the distortion coefficients, respectively. 

Usually, the coefficients x 6 and higher are not necessary, and will be neglected 
consequently. However, it is important to decide whether x 4 is used or not. This 
mainly depends on the quality of the lens. The decision about using one or two coeffi- 
cients should be made after a first analysis (see section 4). Assume that this decision 
has been done. However, one of the weakly differing four models should be selected. 
Which is the best one? Unfortunately, this depends on the actual distortion which is to 
be determined. 

Thus, two ways are possible to select the right model. First, all models are 
applied. A quality measure evaluates the results, and the model leading to the best 
result is selected. We however, suggest the following procedure: The model which has 
the best performance (best numerical behaviour and implementation) is applied. In our 
case, this is the model with the coefficients d 2 and d 4 (see [1,2]). Finally, it can be 
tested, whether a model conversion improves the result, or the remaining errors due to 
the model deviation are processed in the final step of distortion determination (see 
section 2.4). 

2.3 Decentering Distortion 

Decentering distortion is a result of the decentering of the lenses (see[7]) and can be 
described by the following approach [3]: 

Ax dec = A ■ (r' 2 + 2x' 2 ) + lb 2 - x'-v' 

( 2 ) 

A y dec = b 2 ■ (r' 2 +2 v' 2 ) + 2 V x'-y 



where Ax dec and A y dec are the pixel errors resulting from decentering distortion as a 
function of the distorted coordinates. Equivalently, the model 

Ax dec = a 1 ■ (r 2 + 2x 2 ) + 2 a 2 • x • y 

A y d ec = a 2 ■ (r 2 + 2 V 2 ) + 2a x ■ x ■ y 

as a function of the unstorted coordinates can be considered. The determination of the 
decentering distortion can be obtained within the camera calibration procedure or 
separately by iterative methods [10,14]. One of the iteration method will be briefly 
outlined in section 4.2. 

2.4 Remaining Distortion 

After removal of the radial and decentering distortion some camera systems are al- 
ready very close to the ideal pinhole model. However, especially low cost cameras 
suffer from some remaining distortion which can not be described by a simple func- 
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tion. Our approach for the description of the remaining distortion is outlined briefly 
next. 

The distortion is characterised by a number of vectors placed on certain points in 
the image. The number of these vectors depends on the local change of the remaining 
distortion and is set by the user. In the extreme case, every pixel in the image has its 
own distortion vector. For illustration see fig.3. Here the length of the distortion vec- 
tors is 100 times higher (for illustration) than the actual distortion. 

2.5 Quality Criterion 

In order to evaluate the quality of the distortion determination and correction, a qual- 
ity measure is defined. Assuming that an ideal image is present (which is obtained by 
some good estimation) the quality of the correction is expressed by the averaged re- 
sidual error ARE (also known as RMS) and the maximum residual error (MRE) be- 
tween the corrected image and the ideal one: 

ARE = -Xa/( x < “ x ? f + O'.- - yf ) 2 

n i=1 (4) 

MRE = max j^/ (*,. - xf ) 2 + (v, - yf f j 



3 Calibration Patterns 

In order to determine and correct the lens distortion of several camera systems, two 
calibration patterns were used. The first one was a grid pattern from a tripod table (see 
fig. 1). Here the intersection points of the horizontal and vertical lines were used. The 
second one was a dot pattern (see fig. 2) where the centre points of the dots were used. 




Fig. 1. Grid pattern Fig. 2. Point pattern 
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4 Algorithms 

4.1 Determination of the Radial Distortion Function 

The used algorithm to obtain the radial distortion parameters X s , Y s , d 2 , and d 4 is de- 
scribed in detail elsewhere [1,2] and is only briefly outlined here. 

All points which are collinear in reality should be on a straight line in the undis- 
torted image. These points are used to construct point triples. Finally, a linear optimi- 
sation task is formulated which has the distortion coefficients d 2 and d 4 as a solution 
with known coordinates of the distortion centre. The unknown distortion centre and 
the distortion coefficients are determined within an iterative process starting with the 
image centre as the distortion centre. 

4.2 Determination of the Decentering Distortion Function 

Assume that decentering distortion is present. A simple fitting of a projective trans- 
form of the real point coordinates to the image coordinates obtained by radial distor- 
tion correction is not the right way to determine the decentering distortion, because 
convergence is not sure. A better algorithm can be constructed using the properties of 
the decentering distortion. A straight line is transformed into a curve which does not 
intersect this straight line. Here, the approximation of the ideal image by fitting tan- 
gents to the distorted straight lines is better than fitting the points with minimal 
Euclidean distance (see fig. 3). 

The iterative algorithm to obtain decentering distortion is the following: 

Input: - uncorrected distorted point coordinates assigned to straight lines, 

- radial distortion parameters 

Algorithm 

1 . Point correction with radial distortion parameters 

2. Fitting of straight lines as tangents 

3. Determination of decentering distortion coefficients 

4. Quality measure analysis 

improvement A new determination of radial distortion parameters, goto 1 
no improvement A end of the algorithm 

Output: - decentering distortion coefficients and new radial distortion parameters or 

- information that no significant decentering distortion present 




Fig. 3. Left: straight line with distorted curve (simulated with overstatement), middle: distorted 
curve with fittet line, right: distorted curve with tangent; the tangent is closer to the undistorted 
line and thus the chance to achieve convergence is higher 
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4.3 Analysis of Residuals 

After correction of the radial and decentering distortion, the remaining errors con- 
cerning to the expected ideal coordinates of the corrected points can be considered. 
Figure 3 shows an example of such residual errors. 

The residual errors in fig.4 seem to be systematic, but other error sources must 
be considered, too. One error source is the possible deviation of the original points 
from the expected coordinates. Second, there may be errors in the determination of the 
distorted point coordinates in the original image. Errors resulting from noise can be a 
third source. 

In order to avoid these influences, a meaningful averaging of the residual error 
vectors should be applied. The first influence can be excluded by averaging views 
with rotated optical axis (covering the whole range) and views from different calibra- 
tion patterns. The second one can also be reduced by averaging images with a weakly 
rotated optical axis (a few degree are sufficient). The errors resulting from noise are 
already reduced by the first two averaging procedures. Assuming that the distortion 
changes are locally slow, a number of pixels can be united to cells with a common 
distortion value. See fig. 5 for an example of averaged residuals. 




\ 


- 


— 




V 


\ 


t 




* 


1 


1 


♦ 


» 


* 


\ 


N 




— 


- 


N 



Fig. 4. Residuals from single image Fig. 5. Averaged residuals 



5 Experiments and Results 

A number of images of the described calibration pattern were recorded using different 
digital cameras (Kodak DC210, Casio QV3500, Olympus C2500). To avoid the effect 
of changing distortion depending on the distance from the pattern to the camera, the 
images were taken from a constant distance. The determination of radial and decen- 
tering distortion was performed with each single image, and the resulting distortion 
coefficients were averaged. Note that the radial coefficients d 2 and d 4 , and the decen- 
tering coefficients b\ and b 2 were calculated by least squares techniques [2], 

The corrected images were used to perform residual analysis as described. Now, 
every corrected image was newly corrected in such a way that every point coordinate 
was corrected with the mean residual vector from the corresponding cell. The quality 





576 



C. Brauer-Burchardt 



criterion to evaluate the correction result was applied again and the ARE and MRE 
values are used to characterise the correction result. 

All results are summarised in table 1 . The distortion parameters and the ARE and 
MRE values are given for the uncorrected images, for the images with radial and de- 
centering correction, and for the finally corrected images. Figure 6 shows an example 
of a distorted image and fig. 7 shows the corresponding corrected one. In order to 
illustrate the distortion effect a stretched view is added in both cases. 



Table 1 . Results of distortion determination and correction 



Camera 


Kodak 


Casio 


Olympus 


Image size 


1152 x864 


1024x768 


856 x 684 


Symmetry point X,Y 


578/381 


496/387 


427 / 367 


Radial coefficients d 2 


-1.62 * 10 -7 


-1.87 * 10' 7 


-3.37 * 10' 7 


Radial coefficient d 4 


±2.36 * 10 -13 


±1.81 * 10 -13 


±2.75 * 10 -13 


Decentering coefficient bi 


2.31 * 10 -7 


n.s. 


n.s. 


Decentering coefficient b 2 


n.s. 


n.s. 


n.s. 


ARE / MRE without correction 


1.49/2.71 


1.86/4.94 


2.63/5.00 


ARE / MRE after rad.±dec. corr. 


0.18/0.45 


0.17/0.62 


0.13/0.53 


ARE / MRE with residual corr. 


0.08/0.22 


0.09/0.34 


0.06/0.18 




Fig. 6. Distorted image Fig. 7. Corrected view 



6 Summary, Discussion, and Outlook 

A simple new methodology to determine and correct lens distortion typically occur- 
ring in low cost camera systems was presented. This method considerably improves 
measurements obtained by such cameras and makes these systems applicable for pre- 
cise measurements. 

The results for the averaged residual error of less than 1/10 pixel show that al- 
most no other error is present after final correction. It should be taken into account 
that in the calibration patterns there can be deviations of the dot centres or line inter- 
section points from the ideal ones. This error, in our example given by ±0. 008mm 
standard deviation which means about ±0.04 pixels. Another error source occurs in the 
determination of the dot centres or line intersection points by means of image proc- 
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essing. Here, an uncertainty (standard deviation) of ±0.02 up to 0.05 pixels was de- 
termined experimentally. The use of a calibration pattern improves the accuracy as 
compared with methods using only line segments of urban scenes (see [1,5]). 

A remaining error which can not be determined by this method is reduced to a 
similarity transform of the image and leads to a change of the extrinsic and intrinsic 
camera parameters. Thus, this error can be neglected. 

Future work should be concerned with a further analysis of typical distortion 
pattern and finding analytical descriptions. Additionally, the known dependence of the 
distortion on the distance between object and camera should be considered and in- 
cluded in the modelling. 
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